I have been analyzing the Seoul Bike Sharing Demand dataset, which is available at Seoul Bike Sharing Demand . During my analysis, I found the need to use a resampling method. To accomplish this, I loaded the dataset into a Pandas DataFrame, which I named df. Then, I applied the desired resampling method using the following code snippet:
label_encoders = {}
categorical_columns = list()
for column in df.columns:
if df.dtypes[column] in [np.int64, np.float64]:
pass
else:
if column != 'Date':
categorical_columns.append(column)
Label_encoder = LabelEncoder()
label_encoders[column] = Label_encoder
numerical_column = Label_encoder.fit_transform(df[column])
df[column] = numerical_column
ndf = df.copy()
ndf.set_index('Date', inplace = True)
I want to draw a correlation between Holidays and Rented Bike count.
Holiday = ndf[ndf['Holiday'] == 0].resample('D')['Rented Bike Count'].sum()
But I expected that it does not include days that are not holiday which in this case , their Holiday column should not be 1(0 -> Holiday, 1 -> Not Holiday) . But when I run this code, the result is something like this:
Date
2017-12-22 7184
2017-12-23 0
2017-12-24 2014
2017-12-25 3966
2017-12-26 0
...
2018-10-05 0
2018-10-06 0
2018-10-07 0
2018-10-08 0
2018-10-09 0
Freq: D, Name: Rented Bike Count, Length: 292, dtype: int64
For example 2017-12-23 is not Holiday but is included in the result and other days like
2018-10-05. But if I run the code with a minor change for non-holiday days it seems to work fine
Holiday = ndf[ndf['Holiday'] == 1].resample('D')['Rented Bike Count'].sum()
And it as expected does not included the holiday days in the processing . I mean why this problem pops up? What am I doing wrong in this case? Thanks in advance
I mean why this problem pops up? What am I doing wrong in this case?
It's not a problem and you're not doing anything wrong :-)
When you resample a time series, you aggregate values for a same time slice. In your case, you convert an hourly TS to a daily TS and compute the sum of Rented Bike Count
. As you filter your dataframe to keep only holidays, you expected to have only 18 days (432 records where holidays=0 / 24 hours)
You're right but remember, you use resample
so Pandas will return a DatetimeIndex
with a freq=D
from the first date (first holiday) to the last date (last holiday). So in your case all days between 2017-12-22 and 2018-09-25 (278 days):
m = ndf['Holiday'] == 0
start = ndf[m].index.min()
end = ndf[m].index.max()
# Output
>>> m.sum() / 24
18.0
>>> start
Timestamp('2017-12-22 00:00:00') # the first holiday
>>> end
Timestamp('2018-09-25 00:00:00') # the last holiday
>>> end - start + pd.Timedelta(days=1) # not inclusive, add 1 day
Timedelta('278 days 00:00:00')
The output of your code is:
>>> ndf[ndf['Holiday'] == 0].resample('D')['Rented Bike Count'].sum()
Date
2017-12-22 7184 # first holiday
2017-12-23 0
2017-12-24 2014
2017-12-25 3966
2017-12-26 0
...
2018-09-21 0
2018-09-22 0
2018-09-23 20060
2018-09-24 17259
2018-09-25 23350 # last holiday
Freq: D, Name: Rented Bike Count, Length: 278, dtype: int64
# ^-- daily freq ^-- number of days
To get only holidays, you have to use the boolean mask m
:
>>> ndf[ndf['Holiday'] == 0].resample('D')['Rented Bike Count'].sum()[m[m].index.unique()]
Date
2017-12-22 7184
2017-12-24 2014
2017-12-25 3966
2017-12-31 3423
2018-01-01 4290
2018-02-15 3484
2018-02-16 2593
2018-02-17 3118
2018-01-03 5132
2018-01-05 26820
2018-05-22 13790
2018-06-06 30498
2018-08-15 18565
2018-09-23 20060
2018-09-24 17259
2018-09-25 23350
2018-03-10 30349
2018-09-10 0
Name: Rented Bike Count, dtype: int64
Note: you should sort your dataframe by index (Date) to better understanding.