I have a large dataset (here a link to a subset https://drive.google.com/open?id=1o7dEsRUYZYZ2-L9pd_WFnIX1n10hSA-f) with the tstamp index (2010-01-01 00:00:00
) and the mm of rain. Measurements are taken every 5 minutes for many years:
mm
tstamp
2010-01-01 00:00:00 0.0
2010-01-01 00:05:00 0.0
2010-01-01 00:10:00 0.0
2010-01-01 00:15:00 0.0
2010-01-01 00:20:00 0.0
........
What I want to get is the count of rainy days for each month for each year. So ideally a dataframe like the following
tstamp rainy not rainy
2010-01 11 20
2010-02 20 8
......
2012-10 15 16
2012-11 30 0
What I'm able to obtain is a nested dict object like d = {year {month: {'rainy': 10, 'not-rainy': 20}... }...}
, made with this small code snippet:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(dict))
for year in df.index.year.unique():
try:
for month in df.index.month.unique():
a = df['{}-{}'.format(year, month)].resample('D').sum()
d[year][month]['rainy'] = a[a['mm'] != 0].count()
d[year][month]['not_rainy'] = a[a['mm'] == 0].count()
except:
pass
But I think I'm missing an easier and more straightforward solution. Any suggestion?
One way is to do two groupby
:
daily = df['mm'].gt(0).groupby(df.index.normalize()).any()
monthly = (daily.groupby(daily.index.to_period('M'))
.value_counts()
.unstack()
)