Search code examples
pythonpandasdata-analysispandas-resample

Pandas resample method does not work properly


I have been analyzing the Seoul Bike Sharing Demand dataset, which is available at Seoul Bike Sharing Demand . During my analysis, I found the need to use a resampling method. To accomplish this, I loaded the dataset into a Pandas DataFrame, which I named df. Then, I applied the desired resampling method using the following code snippet:

label_encoders = {}
categorical_columns = list()
for column in df.columns:
    if df.dtypes[column] in [np.int64, np.float64]:
        pass
    else:
        if column != 'Date':
            categorical_columns.append(column)
            Label_encoder = LabelEncoder()
            label_encoders[column] = Label_encoder
            
            numerical_column = Label_encoder.fit_transform(df[column])
            df[column] = numerical_column
ndf = df.copy()
ndf.set_index('Date', inplace = True)

I want to draw a correlation between Holidays and Rented Bike count.

Holiday = ndf[ndf['Holiday'] == 0].resample('D')['Rented Bike Count'].sum()

But I expected that it does not include days that are not holiday which in this case , their Holiday column should not be 1(0 -> Holiday, 1 -> Not Holiday) . But when I run this code, the result is something like this:

Date
2017-12-22    7184
2017-12-23       0
2017-12-24    2014
2017-12-25    3966
2017-12-26       0
              ... 
2018-10-05       0
2018-10-06       0
2018-10-07       0
2018-10-08       0
2018-10-09       0
Freq: D, Name: Rented Bike Count, Length: 292, dtype: int64

For example 2017-12-23 is not Holiday but is included in the result and other days like
2018-10-05. But if I run the code with a minor change for non-holiday days it seems to work fine

Holiday = ndf[ndf['Holiday'] == 1].resample('D')['Rented Bike Count'].sum()

And it as expected does not included the holiday days in the processing . I mean why this problem pops up? What am I doing wrong in this case? Thanks in advance


Solution

  • I mean why this problem pops up? What am I doing wrong in this case?

    It's not a problem and you're not doing anything wrong :-)

    When you resample a time series, you aggregate values for a same time slice. In your case, you convert an hourly TS to a daily TS and compute the sum of Rented Bike Count. As you filter your dataframe to keep only holidays, you expected to have only 18 days (432 records where holidays=0 / 24 hours)

    You're right but remember, you use resample so Pandas will return a DatetimeIndex with a freq=D from the first date (first holiday) to the last date (last holiday). So in your case all days between 2017-12-22 and 2018-09-25 (278 days):

    m = ndf['Holiday'] == 0
    start = ndf[m].index.min()
    end = ndf[m].index.max()
    
    # Output
    >>> m.sum() / 24
    18.0
    
    >>> start
    Timestamp('2017-12-22 00:00:00')  # the first holiday
    
    >>> end
    Timestamp('2018-09-25 00:00:00')  # the last holiday
    
    >>> end - start + pd.Timedelta(days=1)  # not inclusive, add 1 day
    Timedelta('278 days 00:00:00')
    

    The output of your code is:

    >>> ndf[ndf['Holiday'] == 0].resample('D')['Rented Bike Count'].sum()
    Date
    2017-12-22     7184  # first holiday
    2017-12-23        0
    2017-12-24     2014
    2017-12-25     3966
    2017-12-26        0
                  ...  
    2018-09-21        0
    2018-09-22        0
    2018-09-23    20060
    2018-09-24    17259
    2018-09-25    23350  # last holiday
    Freq: D, Name: Rented Bike Count, Length: 278, dtype: int64
    #     ^-- daily freq                       ^-- number of days
    

    To get only holidays, you have to use the boolean mask m:

    >>> ndf[ndf['Holiday'] == 0].resample('D')['Rented Bike Count'].sum()[m[m].index.unique()]
    
    Date
    2017-12-22     7184
    2017-12-24     2014
    2017-12-25     3966
    2017-12-31     3423
    2018-01-01     4290
    2018-02-15     3484
    2018-02-16     2593
    2018-02-17     3118
    2018-01-03     5132
    2018-01-05    26820
    2018-05-22    13790
    2018-06-06    30498
    2018-08-15    18565
    2018-09-23    20060
    2018-09-24    17259
    2018-09-25    23350
    2018-03-10    30349
    2018-09-10        0
    Name: Rented Bike Count, dtype: int64
    

    Note: you should sort your dataframe by index (Date) to better understanding.