Search code examples
pythonpandasresampling

Pandas count monthle rainy vs not rainy days starting from hourly data


I have a large dataset (here a link to a subset https://drive.google.com/open?id=1o7dEsRUYZYZ2-L9pd_WFnIX1n10hSA-f) with the tstamp index (2010-01-01 00:00:00) and the mm of rain. Measurements are taken every 5 minutes for many years:

                      mm
tstamp                  
2010-01-01 00:00:00  0.0
2010-01-01 00:05:00  0.0
2010-01-01 00:10:00  0.0
2010-01-01 00:15:00  0.0
2010-01-01 00:20:00  0.0
........

What I want to get is the count of rainy days for each month for each year. So ideally a dataframe like the following

tstamp    rainy  not rainy
2010-01   11     20
2010-02   20     8
......
2012-10   15     16
2012-11   30     0

What I'm able to obtain is a nested dict object like d = {year {month: {'rainy': 10, 'not-rainy': 20}... }...}, made with this small code snippet:

from collections import defaultdict


d = defaultdict(lambda: defaultdict(dict))

for year in df.index.year.unique():

    try:
        for month in df.index.month.unique():

            a = df['{}-{}'.format(year, month)].resample('D').sum()

            d[year][month]['rainy'] = a[a['mm'] != 0].count()
            d[year][month]['not_rainy'] = a[a['mm'] == 0].count()

    except:
        pass

But I think I'm missing an easier and more straightforward solution. Any suggestion?


Solution

  • One way is to do two groupby:

    daily = df['mm'].gt(0).groupby(df.index.normalize()).any()
    monthly = (daily.groupby(daily.index.to_period('M'))
                    .value_counts()
                    .unstack()
              )