I'm working with a data set from a url, which I've messily formatted to the following:
url = r"https://psl.noaa.gov/enso/mei/data/meiv2.data"
r = requests.post(url) #read url
s = r.text #url data to string
df = pd.read_csv(io.StringIO(s), sep=',', header=None, quoting=csv.QUOTE_ALL) #make pd dataframe
df2 = df.drop(index=[0, 46, 47, 48, 49]) #drop the unnessicary data
df3 = df2[0].str.split('\s+', expand = True) #reformat to new df
df3.columns = ['Year', 'DJ', 'JF', 'FM', 'MA', 'AM', 'MJ', 'JJ', 'JA', 'AS', 'SO', 'ON', 'ND'] #name the columns
Printing df3 results in the dataframe: that looks like this
This is fine to look at, but I want to create a time-series plot of MEI from 1979-2023 including each month. That is to say, the X-axis from Dec-Jan 1979 thru Nov-Dec 2023 with MEI values on the Y-axis.
So far, I've tried transposing the dataframe, but ran into the same issues when formatting a plot.
The first thing I would do, to simplify your code, is point out that read_csv()
is capable of doing a bunch of the transformations you're doing, just by specifying options. For example, you can split fields by whitespace by including delim_whitespace=True
.
import pandas as pd
months = ['DJ', 'JF', 'FM', 'MA', 'AM', 'MJ', 'JJ', 'JA', 'AS', 'SO', 'ON', 'ND']
df = pd.read_csv(
"https://psl.noaa.gov/enso/mei/data/meiv2.data",
delim_whitespace=True,
names=['Year', *months],
skiprows=1,
skipfooter=4,
na_values=[-999],
engine='python'
)
The next step is to "unpivot" the data. I use a function called DataFrame.melt()
to do this. By setting value_vars=months
, I am telling it that each value in each row and column should become a new row, labeled by what row (year) and what column (month) it was in.
df = df.melt(id_vars=['Year'], var_name='month', value_vars=months)
This gives us a dataframe with year, seasonal code, and value:
Year month value
0 1979 DJ 0.47
1 1980 DJ 0.35
2 1981 DJ -0.33
Next, I map this month code into an integer to represent the month of the year. I turn DJ
into 1, JF
into 2, and so on. This step is a little scientifically iffy - it sounds like this data represents a span of time, and those spans of time overlap. You should ask someone who understands the dataset if what I'm doing here makes sense.
df['month'] = df['month'].map({m: i + 1 for i, m in enumerate(months)})
Next, I take the year column and month column and combine it into a single date column. This allows me to use Pandas to plot the data as a time-series.
df['date'] = pd.to_datetime(df['Year'].map(str) + '-' + df['month'].map(str))
This next step is purely aesthetic: I find it nicer to work with sorted dataframes.
df = df.sort_values('date')
Finally, after preparing the data in the correct format, you can make a plot with:
df.plot(x='date', y='value')
Output: