site,date,start_hour,value,variable,units,quality,prelim,name
3135,2010-01-01,0,13.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,1,5.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,2,11.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,3,17.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,5,16.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
Here's a link to find similar CSV data https://www.arb.ca.gov/aqmis2/aqdselect.php?tab=hourly
I've attached some code below:
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
path = "C:/Users/blah"
for f in glob.glob(os.path.join(path, "*.csv")):
df = pd.read_csv(f, header = 0, index_col='date')
df2 = df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'], inplace = True)
df = df.iloc[0:]
df.index = pd.to_datetime(df.index) #converting date to datetime
df['start_hour'] = pd.to_timedelta(df['start_hour'], unit = 'h')
df['datetime'] = df.index + df['start_hour']
df.set_index('datetime', inplace = True)
df2 = df.value.rolling('8H', min_periods = 6).mean()
df2.index -= pd.DateOffset(hours=3)
df2 = df4.resample('D').max()
df2.index.name = 'timestamp'
The problem occurs below:
df3 = df2.groupby(pd.Grouper(freq = 'M')).mean()
df4 = df3[df3.index.month.isin([4,5,6])]
if df4 == True:
plt.plot(df3.index, df3.values)
print(df4)
whenever I do this, I get a message saying "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
When I try this code with df4.any() == True:
, it plots all of the months except April-June and it plots all values in the same plot. I want different plots for each month.
I've also tried adding the the following and removing the previous if statement:
df5 = df4.index.year.isin([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])
if df5.all() == True:
plt.plot(df4.index, df4.values)
However, this gives me an image like:
Again, I want to make a separate scatterplot for each month, although this is closer to what I want. Any help would be appreciated, thanks.
EDIT In addition, I have 2020 data, which only extends to the month of July. I don't think this is going to affect my graph, but I just wanted to mention it. Ideally, I want it to look something like this, but a different point for each year and for the individual month of April
df.index -= pd.DateOffset(hours=3)
has been removed for being potentially problematic
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import date
from pandas.tseries.offsets import MonthEnd
# set the path to the files
p = Path('/PythonProjects/stack_overflow/data/ozone/')
# list of files
files = list(p.glob('OZONE*.csv'))
# create a dataframe from the files - all years all data
df = pd.concat([pd.read_csv(file) for file in files])
# format the dataframe
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df = df[df.month.isin([4, 5, 6])].copy() # filter the dataframe - only April, May, June
df.set_index('datetime', inplace = True)
# calculate the 8-hour rolling mean
df['r_mean'] = df.value.rolling('8H', min_periods=6).mean()
# determine max value per day
r_mean_daily_max = df.groupby(['year', 'month', 'day'], as_index=False)['r_mean'].max()
# calculate the mean from the daily max
mda8 = r_mean_daily_max.groupby(['year', 'month'], as_index=False)['r_mean'].mean()
# add a new datetime column with the date as the end of the month
mda8['datetime'] = pd.to_datetime(mda8.year.astype(str) + mda8.month.astype(str), format='%Y%m') + MonthEnd(1)
df.info()
& .head()
before any processing<class 'pandas.core.frame.DataFrame'>
Int64Index: 78204 entries, 0 to 4663
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 78204 non-null int64
1 date 78204 non-null object
2 start_hour 78204 non-null int64
3 value 78204 non-null float64
4 variable 78204 non-null object
5 units 78204 non-null object
6 quality 4664 non-null float64
7 prelim 4664 non-null object
8 name 78204 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 6.0+ MB
site date start_hour value variable units quality prelim name
0 3135 2011-01-01 0 14.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
1 3135 2011-01-01 1 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
2 3135 2011-01-01 2 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
3 3135 2011-01-01 3 25.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
4 3135 2011-01-01 5 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
df.info
& .head()
after processing<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20708 entries, 2011-04-01 00:00:00 to 2020-06-30 23:00:00
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 20708 non-null int64
1 value 20708 non-null float64
2 variable 20708 non-null object
3 units 20708 non-null object
4 quality 2086 non-null float64
5 prelim 2086 non-null object
6 name 20708 non-null object
7 month 20708 non-null int64
8 day 20708 non-null int64
9 year 20708 non-null int64
10 r_mean 20475 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.9+ MB
site value variable units quality prelim name month day year r_mean
datetime
2011-04-01 00:00:00 3135 13.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 01:00:00 3135 29.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 02:00:00 3135 31.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 03:00:00 3135 28.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 05:00:00 3135 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
r_mean_daily_max.info()
and .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 910 entries, 0 to 909
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 910 non-null int64
1 month 910 non-null int64
2 day 910 non-null int64
3 r_mean 910 non-null float64
dtypes: float64(1), int64(3)
memory usage: 35.5 KB
year month day r_mean
0 2011 4 1 44.125
1 2011 4 2 43.500
2 2011 4 3 42.000
3 2011 4 4 49.625
4 2011 4 5 45.500
mda8.info()
& .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 30 non-null int64
1 month 30 non-null int64
2 r_mean 30 non-null float64
3 datetime 30 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 1.2 KB
year month r_mean datetime
0 2011 4 49.808135 2011-04-30
1 2011 5 55.225806 2011-05-31
2 2011 6 58.162302 2011-06-30
3 2012 4 45.865278 2012-04-30
4 2012 5 61.061828 2012-05-31
sns.lineplot(mda8.datetime, mda8.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
# create color mapping based on all unique values of year
years = mda8.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for g, d in mda8.groupby('year'):
sns.lineplot(d.datetime, d.r_mean, marker='o', hue=g, palette=cmap)
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
sns.barplot(x='month', y='r_mean', data=mda8, hue='year')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.title('MDA8: April - June')
plt.ylabel('mda8 (ppb)')
plt.show()
for month in mda8.month.unique():
data = mda8[mda8.month == month] # filter and plot the data for a specific month
plt.figure() # create a new figure for each month
sns.lineplot(data.datetime, data.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.title(f'Month: {month}')
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
for month in mda8.month.unique():
data = mda8[mda8.month == month]
sns.lineplot(data.datetime, data.r_mean, marker='o', label=month)
plt.legend(title='Month')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
# create color mapping based on all unique values of year
years = df.index.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(16, 10))
for year in v.index.year.unique(): # withing the month plot each year
data = v[v.index.year == year]
sns.lineplot(data.index.day, data.r_mean, err_style=None, hue=year, palette=cmap)
plt.xlim(0, 33)
plt.xticks(range(1, 32))
plt.title(f'Month: {k}')
plt.xlabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(10, 20))
sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
plt.title(f'Month: {k}')
plt.ylabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()