Search code examples
pythondataframedatetime-series

How do I format Year-Month data with row = 'year' and columns = 'months'


I'm working with a data set from a url, which I've messily formatted to the following:

url = r"https://psl.noaa.gov/enso/mei/data/meiv2.data"
r = requests.post(url) #read url
s = r.text #url data to string
df = pd.read_csv(io.StringIO(s), sep=',', header=None, quoting=csv.QUOTE_ALL) #make pd dataframe
df2 = df.drop(index=[0, 46, 47, 48, 49]) #drop the unnessicary data
df3 = df2[0].str.split('\s+', expand = True) #reformat to new df
df3.columns = ['Year', 'DJ', 'JF', 'FM', 'MA', 'AM', 'MJ', 'JJ', 'JA', 'AS', 'SO', 'ON', 'ND'] #name the columns

Printing df3 results in the dataframe: that looks like this

This is fine to look at, but I want to create a time-series plot of MEI from 1979-2023 including each month. That is to say, the X-axis from Dec-Jan 1979 thru Nov-Dec 2023 with MEI values on the Y-axis.

So far, I've tried transposing the dataframe, but ran into the same issues when formatting a plot.


Solution

  • The first thing I would do, to simplify your code, is point out that read_csv() is capable of doing a bunch of the transformations you're doing, just by specifying options. For example, you can split fields by whitespace by including delim_whitespace=True.

    import pandas as pd
    
    months = ['DJ', 'JF', 'FM', 'MA', 'AM', 'MJ', 'JJ', 'JA', 'AS', 'SO', 'ON', 'ND']
    
    df = pd.read_csv(
        "https://psl.noaa.gov/enso/mei/data/meiv2.data",
        delim_whitespace=True,
        names=['Year', *months],
        skiprows=1,
        skipfooter=4,
        na_values=[-999],
        engine='python'
    )
    

    The next step is to "unpivot" the data. I use a function called DataFrame.melt() to do this. By setting value_vars=months, I am telling it that each value in each row and column should become a new row, labeled by what row (year) and what column (month) it was in.

    df = df.melt(id_vars=['Year'], var_name='month', value_vars=months)
    

    This gives us a dataframe with year, seasonal code, and value:

       Year month  value
    0  1979    DJ   0.47
    1  1980    DJ   0.35
    2  1981    DJ  -0.33
    

    Next, I map this month code into an integer to represent the month of the year. I turn DJ into 1, JF into 2, and so on. This step is a little scientifically iffy - it sounds like this data represents a span of time, and those spans of time overlap. You should ask someone who understands the dataset if what I'm doing here makes sense.

    df['month'] = df['month'].map({m: i + 1 for i, m in enumerate(months)})
    

    Next, I take the year column and month column and combine it into a single date column. This allows me to use Pandas to plot the data as a time-series.

    df['date'] = pd.to_datetime(df['Year'].map(str) + '-' + df['month'].map(str))
    

    This next step is purely aesthetic: I find it nicer to work with sorted dataframes.

    df = df.sort_values('date')
    

    Finally, after preparing the data in the correct format, you can make a plot with:

    df.plot(x='date', y='value')
    

    Output:

    plot of mei data