Search code examples
pandasmatplotlibtimeserieschartdatetimeindex

How to make a graph plotting monthly data over many years in pandas


  • I have 11 years worth of hourly ozone concentration data.
    • There are 11 csv files containing ozone concentrations at every hour of every day.
  • I was able to read all of the files in and convert the index from date to datetime.
  • For my graph:
    • I calculated the maximum daily 8-hour average and then averaged those values over each month.
  • My new dataframe (df3) has:
    • a datetime index, which consists of the last day of the month for each month of the year over the 12 years.
    • It also has a column including the average MDA8 values.
  • I want make 3 separate scatter plots for the months of April, May, and June. (x axis = year, y axis = average MDA8 for the month)
    • However, I am getting stuck on how to call these individual months and plot the yearly data.

Minimal sample

site,date,start_hour,value,variable,units,quality,prelim,name 
3135,2010-01-01,0,13.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,1,5.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,2,11.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,3,17.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,5,16.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street 

Here's a link to find similar CSV data https://www.arb.ca.gov/aqmis2/aqdselect.php?tab=hourly

I've attached some code below:

import pandas as pd
import os
import glob
import matplotlib.pyplot as plt

path = "C:/Users/blah"
for f in glob.glob(os.path.join(path, "*.csv")):
    df = pd.read_csv(f, header = 0, index_col='date')
    df2 = df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'], inplace = True) 
    df = df.iloc[0:]
    df.index = pd.to_datetime(df.index) #converting date to datetime
    df['start_hour'] = pd.to_timedelta(df['start_hour'], unit = 'h')
    df['datetime'] = df.index + df['start_hour']
    df.set_index('datetime', inplace = True)

    df2 = df.value.rolling('8H', min_periods = 6).mean() 
    df2.index -= pd.DateOffset(hours=3)
    df2 = df4.resample('D').max()
    df2.index.name = 'timestamp'

The problem occurs below:

    df3 = df2.groupby(pd.Grouper(freq = 'M')).mean()
    df4 = df3[df3.index.month.isin([4,5,6])]
    if df4 == True:
        plt.plot(df3.index, df3.values)
    print(df4)

whenever I do this, I get a message saying "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." When I try this code with df4.any() == True:, it plots all of the months except April-June and it plots all values in the same plot. I want different plots for each month.

I've also tried adding the the following and removing the previous if statement:

df5 = df4.index.year.isin([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])
    if df5.all() == True:
        plt.plot(df4.index, df4.values)

However, this gives me an image like:
Apr-Jun MDA8 values

Again, I want to make a separate scatterplot for each month, although this is closer to what I want. Any help would be appreciated, thanks.

EDIT In addition, I have 2020 data, which only extends to the month of July. I don't think this is going to affect my graph, but I just wanted to mention it. Ideally, I want it to look something like this, but a different point for each year and for the individual month of April

Scatterplot


Solution

    • df.index -= pd.DateOffset(hours=3) has been removed for being potentially problematic
      • The first hours of each month would be in the previous month
      • The first hours of each day would be in the previous day
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from pathlib import Path
    from datetime import date
    from pandas.tseries.offsets import MonthEnd
    
    # set the path to the files
    p = Path('/PythonProjects/stack_overflow/data/ozone/')
    
    # list of files
    files = list(p.glob('OZONE*.csv'))
    
    # create a dataframe from the files - all years all data
    df = pd.concat([pd.read_csv(file) for file in files])
    
    # format the dataframe
    df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
    df.date = pd.to_datetime(df.date)
    df['datetime'] = df.date + df.start_hour
    df.drop(columns=['date', 'start_hour'], inplace=True)
    df['month'] = df.datetime.dt.month
    df['day'] = df.datetime.dt.day
    df['year'] = df.datetime.dt.year
    df = df[df.month.isin([4, 5, 6])].copy()  # filter the dataframe - only April, May, June
    df.set_index('datetime', inplace = True)
    
    # calculate the 8-hour rolling mean
    df['r_mean'] = df.value.rolling('8H', min_periods=6).mean()
    
    # determine max value per day
    r_mean_daily_max = df.groupby(['year', 'month', 'day'], as_index=False)['r_mean'].max()
    
    # calculate the mean from the daily max
    mda8 = r_mean_daily_max.groupby(['year', 'month'], as_index=False)['r_mean'].mean()
    
    # add a new datetime column with the date as the end of the month
    mda8['datetime'] = pd.to_datetime(mda8.year.astype(str) + mda8.month.astype(str), format='%Y%m') + MonthEnd(1)
    

    df.info() & .head() before any processing

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 78204 entries, 0 to 4663
    Data columns (total 9 columns):
     #   Column      Non-Null Count  Dtype  
    ---  ------      --------------  -----  
     0   site        78204 non-null  int64  
     1   date        78204 non-null  object 
     2   start_hour  78204 non-null  int64  
     3   value       78204 non-null  float64
     4   variable    78204 non-null  object 
     5   units       78204 non-null  object 
     6   quality     4664 non-null   float64
     7   prelim      4664 non-null   object 
     8   name        78204 non-null  object 
    dtypes: float64(2), int64(2), object(5)
    memory usage: 6.0+ MB
    
       site        date  start_hour  value variable                      units  quality prelim                   name 
    0  3135  2011-01-01           0   14.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
    1  3135  2011-01-01           1   11.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
    2  3135  2011-01-01           2   22.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
    3  3135  2011-01-01           3   25.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
    4  3135  2011-01-01           5   22.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street 
    

    df.info & .head() after processing

    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 20708 entries, 2011-04-01 00:00:00 to 2020-06-30 23:00:00
    Data columns (total 11 columns):
     #   Column    Non-Null Count  Dtype  
    ---  ------    --------------  -----  
     0   site      20708 non-null  int64  
     1   value     20708 non-null  float64
     2   variable  20708 non-null  object 
     3   units     20708 non-null  object 
     4   quality   2086 non-null   float64
     5   prelim    2086 non-null   object 
     6   name      20708 non-null  object 
     7   month     20708 non-null  int64  
     8   day       20708 non-null  int64  
     9   year      20708 non-null  int64  
     10  r_mean    20475 non-null  float64
    dtypes: float64(3), int64(4), object(4)
    memory usage: 1.9+ MB
    
                         site  value variable                      units  quality prelim                   name   month  day  year  r_mean
    datetime                                                                                                                              
    2011-04-01 00:00:00  3135   13.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
    2011-04-01 01:00:00  3135   29.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
    2011-04-01 02:00:00  3135   31.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
    2011-04-01 03:00:00  3135   28.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
    2011-04-01 05:00:00  3135   11.0    OZONE  Parts Per Billion ( ppb )      NaN    NaN  Calexico-Ethel Street       4    1  2011     NaN
    

    r_mean_daily_max.info() and .head()

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 910 entries, 0 to 909
    Data columns (total 4 columns):
     #   Column  Non-Null Count  Dtype  
    ---  ------  --------------  -----  
     0   year    910 non-null    int64  
     1   month   910 non-null    int64  
     2   day     910 non-null    int64  
     3   r_mean  910 non-null    float64
    dtypes: float64(1), int64(3)
    memory usage: 35.5 KB
    
       year  month  day  r_mean
    0  2011      4    1  44.125
    1  2011      4    2  43.500
    2  2011      4    3  42.000
    3  2011      4    4  49.625
    4  2011      4    5  45.500
    

    mda8.info() & .head()

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 30 entries, 0 to 29
    Data columns (total 4 columns):
     #   Column    Non-Null Count  Dtype         
    ---  ------    --------------  -----         
     0   year      30 non-null     int64         
     1   month     30 non-null     int64         
     2   r_mean    30 non-null     float64       
     3   datetime  30 non-null     datetime64[ns]
    dtypes: datetime64[ns](1), float64(1), int64(2)
    memory usage: 1.2 KB
    
       year  month     r_mean   datetime
    0  2011      4  49.808135 2011-04-30
    1  2011      5  55.225806 2011-05-31
    2  2011      6  58.162302 2011-06-30
    3  2012      4  45.865278 2012-04-30
    4  2012      5  61.061828 2012-05-31
    

    mda8

    plot 1

    sns.lineplot(mda8.datetime, mda8.r_mean, marker='o')
    plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
    

    enter image description here

    plot 2

    # create color mapping based on all unique values of year
    years = mda8.year.unique()
    colors = sns.color_palette('husl', n_colors=len(years))  # get a number of colors
    cmap = dict(zip(years, colors))  # zip values to colors
    
    for g, d in mda8.groupby('year'):
        sns.lineplot(d.datetime, d.r_mean, marker='o', hue=g, palette=cmap)
        
    plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
    plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
    

    enter image description here

    plot 3

    sns.barplot(x='month', y='r_mean', data=mda8, hue='year')
    plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
    plt.title('MDA8: April - June')
    plt.ylabel('mda8 (ppb)')
    plt.show()
    

    enter image description here

    plot 4

    for month in mda8.month.unique():
        data = mda8[mda8.month == month]  # filter and plot the data for a specific month
        plt.figure()  # create a new figure for each month
        sns.lineplot(data.datetime, data.r_mean, marker='o')
        plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
        plt.title(f'Month: {month}')
        plt.ylabel('MDA8: PPB')
        plt.xlabel('Year')
    
    • There will be one plot per month enter image description here

    plot 5

    for month in mda8.month.unique():
        data = mda8[mda8.month == month]
        sns.lineplot(data.datetime, data.r_mean, marker='o', label=month)
        plt.legend(title='Month')
        plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
        plt.ylabel('MDA8: PPB')
        plt.xlabel('Year')
    

    enter image description here


    • Addressing I want make 3 separate scatter plots for the months of April, May, and June.
    • The main issue is, the data can't be plotted with a datetime axis.
      • The objective is to plot each day on the axis, with each figure as a different month.

    Lineplot

    • It's kind of busy
    • A custom color map has been used because there aren't enough colors in the standard palette to give each year a unique color
    # create color mapping based on all unique values of year
    years = df.index.year.unique()
    colors = sns.color_palette('husl', n_colors=len(years))  # get a number of colors
    cmap = dict(zip(years, colors))  # zip values to colors
    
    for k, v in df.groupby('month'):  # group the dateframe by month
        plt.figure(figsize=(16, 10))
        for year in v.index.year.unique():  # withing the month plot each year
            data = v[v.index.year == year]
            sns.lineplot(data.index.day, data.r_mean, err_style=None, hue=year, palette=cmap)
        plt.xlim(0, 33)
        plt.xticks(range(1, 32))
        plt.title(f'Month: {k}')
        plt.xlabel('Day of Month')
        plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
    plt.show()
    
    • Here's April, the other two figures look similar to this

    enter image description here

    Barplot

    for k, v in df.groupby('month'):  # group the dateframe by month
        plt.figure(figsize=(10, 20))
    
        sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
        plt.title(f'Month: {k}')
        plt.ylabel('Day of Month')
        plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
    plt.show()
    

    enter image description here