Search code examples
pandasdataframemulti-index

Grow pandas dataframe group by group


I have a multi-index Pandas dataframe. In my example there are two levels: vehicles (with attributes A and B) and reference_days (with attributes 1 and 2). For each vehicle, for each day, there is a set of moments in time (in a string-format, such that e.g. '2330' corresponds to 11.30pm and '30' to 0.30am). These moments in time are ordered chronologically, but for 1 reference_day they may cross the "midnight" line. That is, a time moment at 02.00am may be counted to the PREVIOUS day. I want to have a new column that takes value 1 if the time moment of that row actually corresponds to a "new" day (i.e. whether the midnight line has been crossed). This example corresponds to a train timetable where trips between midnight and (approximately) 4am are registered under the preceding day.

Example:

dict = {"vehicle": ["A"]*8 + ["B"]*8,
        "reference_day" : [1, 1, 1, 1, 2, 2, 2, 2]*2,
        "time" : [1830, 2200, 30, 115, 1700, 1800, 2300, 100,
                  1900, 2300, 15, 200, 1500, 2000, 2330, 120]}
df = pd.DataFrame(dict).reset_index(drop=True).set_index(["vehicle", "reference_day"], drop=True)

DataFrame looks like this:

                       time
vehicle reference_day      
A       1              1830
        1              2200
        1                30
        1               115
        2              1700
        2              1800
        2              2300
        2               100
B       1              1900
        1              2300
        1                15
        1               200
        2              1500
        2              2000
        2              2330
        2               120

I want to have an extra column like this:

                       time   next_day
vehicle reference_day      
A       1              1830   0
        1              2200   0
        1                30   1
        1               115   1
        2              1700   0
        2              1800   0
        2              2300   0
        2               100   1
B       1              1900   0
        1              2300   0
        1                15   1
        1               200   1
        2              1500   0
        2              2000   0
        2              2330   0
        2               120   1

How should I achieve this in an elegant way? Hope anyone can help, thanks!


Solution

  • Let's try:

    df['next_day'] = df.groupby(level=[0,1])['time']\
                       .transform(lambda x: x.diff().lt(0).cumsum())
    

    Output:

                           time  next_day
    vehicle reference_day                
    A       1              1830         0
            1              2200         0
            1                30         1
            1               115         1
            2              1700         0
            2              1800         0
            2              2300         0
            2               100         1
    B       1              1900         0
            1              2300         0
            1                15         1
            1               200         1
            2              1500         0
            2              2000         0
            2              2330         0
            2               120         1