I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg
that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg
and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg
concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd
looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me! [1]: https://i.sstatic.net/ueVrG.png
I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31