Search code examples
pythonpandasprobabilityprediction

Probabilistic prediction based on occurrence frequency


I have a time series of rainfall from 2011-2013 where rainfall data in 1 (no rain) and 0 (rain) format. The original data interval is 1 hour and from daily at 10 am-3 pm. I don't want to predict the rainfall for 2014 but I want to predict the chance of rain for the whole year of the same time interval based on the occurrence of 1 or 0 in the rainfall column. Currently, I use the following code to predict the chance of rain by counting 1 or 0 appearances:

import pandas as pd
 
b = {'year':[2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,
             2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,
             2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013],
     'month': [1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12],
     'rain':[1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0]}

b = pd.DataFrame(b,columns = ['year','month','rain'])

def X(b):
    if (b['month'] == 1):
        return 'Jan'
    elif (b['month']==2):
        return 'Feb'
    elif (b['month']==3):
        return 'Mar'
    elif (b['month']==4):
        return 'Apr'
    elif (b['month']==5):
        return 'May'
    elif (b['month']==6):
        return 'Jun'
    elif (b['month']==7):
        return 'Jul'
    elif (b['month']==8):
        return 'Aug'
    elif (b['month']==9):
        return 'Sep'
    elif (b['month']==10):
        return 'Oct'
    elif (b['month']==11):
        return 'Nov'
    elif (b['month']==12):
        return 'Dec' 

b['X'] = b.apply(X,axis =1)

mask_x = (b['X']=='Jul')

mask_y = b['rain'].loc[mask_x]

mask_y.value_counts()

I think this method would not work for large datasets, can someone suggest me an efficient and robust way to predict rainfall from such kind of dataset.


Solution

  • The data was created by randomly selecting [0,1] every hour. We calculated the total and the number of cases by grouping them by time in the date column. Now you can calculate the rainfall rate by total/number of events. I'm following your code to create year, month and month shortened names, but it's not really necessary.

    import pandas as pd
    import numpy as np
    import random
    
    random.seed(20200817)
    
    date_rng = pd.date_range('2013-01-01', '2016-01-01', freq='1H')
    rain = random.choices([0,1], k=len(date_rng))
    b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
    
    hour_rain = b.groupby([b.date.dt.month, b.date.dt.day, b.date.dt.hour])['rain'].agg([sum,np.size])
    hour_rain.index.names = ['month','day','hour']
    
    hour_rain.reset_index()
    
    month   day hour    sum size
    0   1   1   0   0   4
    1   1   1   1   2   3
    2   1   1   2   3   3
    3   1   1   3   1   3
    4   1   1   4   1   3
    ... ... ... ... ... ...
    8755    12  31  19  2   3
    8756    12  31  20  2   3
    8757    12  31  21  2   3
    8758    12  31  22  0   3
    8759    12  31  23  0   3