Search code examples
pythonpandasgroup-by

Adding 1 more column to grouped by dataframe


I currently have these 2 dataframes(before/after)

Before:

timestamp   cnt t1  t2  hum wind_speed  is_holiday  weather year    month   day
0   2015-01-04 00:00:00 182 3.0 2.0 93.0    6.0 0   broken clouds   2015    1   4
1   2015-01-04 01:00:00 138 3.0 2.5 93.0    5.0 0   clear   2015    1   4
2   2015-01-04 02:00:00 134 2.5 2.5 96.5    0.0 0   clear   2015    1   4
3   2015-01-04 03:00:00 72  2.0 2.0 100.0   0.0 0   clear   2015    1   4
4   2015-01-04 04:00:00 47  2.0 0.0 93.0    6.5 0   clear   2015    1   4
... ... ... ... ... ... ... ... ... ... ... ...
17337   2016-12-31 19:00:00 493 9.0 6.0 82.0    17.0    0   cloudy  2016    12  31
17338   2016-12-31 20:00:00 465 8.5 6.5 84.5    13.0    0   cloudy  2016    12  31
17339   2016-12-31 21:00:00 474 8.0 6.0 87.0    12.0    0   cloudy  2016    12  31
17340   2016-12-31 22:00:00 587 8.0 6.0 90.0    12.0    0   cloudy  2016    12  31
17341   2016-12-31 23:00:00 697 8.0 5.0 87.0    18.5    0   cloudy  2016    12  31

After:

timestamp   cnt
0   2015-01-04  9234
1   2015-01-05  20372
2   2015-01-06  20613
3   2015-01-07  21064
4   2015-01-08  15601
... ... ...
722 2016-12-27  10842
723 2016-12-28  12428
724 2016-12-29  14052
725 2016-12-30  11566
726 2016-12-31  11424

So the point is that i deleted all necessary columns and summed up cnt values( that was made to delete hours from dates and merge them into single index containing all cnt's)

But there is also a column called 'is_holiday' that has 0 if no holiday during this day or 1 if holiday during this day. Indexes of this column show 1 or 0 each hour if holiday is during THIS day. I want to make my 2nd dataframe have 3rd column 'is_holiday' that has 1 or 0.

How do i remake my 1st dataframe to also have is_holiday in 2nd dataframe to make it easier find out if 'cnt' which means 'bicycles rent this day' is higher during holiday days or not.

My code:

import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from datetime import datetime
import numpy as np

path2 = r'C:\Users\bossd\OneDrive\Документы\pipirka.csv'
df2 = pd.read_csv(path2) 
df2 = df2.groupby(df2["timestamp"].dt.date)["cnt"].sum().reset_index() # Making 2nd dataframe
x = df2.loc[df2['timestamp'].str.startswith('2015'), 'timestamp']
y = df2.loc[df2['timestamp'].str.startswith('2015'), 'cnt']

Solution

  • Use .groupby + .agg:

    df = (
        df.groupby(df["timestamp"].dt.date)
        .agg({"cnt": "sum", "is_holiday": "first"})
        .reset_index()
    )
    print(df)
    

    Prints:

        timestamp  cnt  is_holiday
    0  2015-01-04  573           0