Search code examples
pythonpandasgroup-bydata-wrangling

Sequence of shifts in pandas


I am working with python and pandas with the following table/dataframe, containing information about work shifts. The cycle type defines the working/non-working days (e.g., 5-3 means 5 days working, 3 not working). The cycle day indicates in which day of the cycle the worker is (for a 5-3 cycle, this will go from 1 to 5). Finally, the shift indicates whether the worker has been in a morning (M), afternoon (A) or night (N) shift:

Worker Cycle_type Cycle_day Shift
Alice 5-3 1 M
Alice 5-3 2 M
Alice 5-3 3 A
Alice 5-3 4 A
Alice 5-3 5 N
Bob 6-2 1 N
Bob 6-2 2 M
Bob 6-2 3 M
Bob 6-2 4 N
Bob 6-2 5 A
Bob 6-2 6 M

The idea is to obtain a new column which indicates the sequence of shifts up the moment of the row. For example, Alice works 2 mornings, then 2 afternoons and then one night, so the sequence the last day would be "M1,M2,A1,A2,N1", where the numbers indicate the numer of iterations of each shift. In the case of Bob, the full sequence for the last day would be "N1,M1,M2,N2,A1,M3". The desired table would be te following:

Worker Cycle_type Cycle_day Shift Sequence
Alice 5-3 1 M M1
Alice 5-3 2 M M1,M2
Alice 5-3 3 A M1,M2,A1
Alice 5-3 4 A M1,M2,A1,A2
Alice 5-3 5 N M1,M2,A1,A2,N1
Bob 6-2 1 N N1
Bob 6-2 2 M N1,M1
Bob 6-2 3 M N1,M1,M2
Bob 6-2 4 N N1,M1,M2,N2
Bob 6-2 5 A N1,M1,M2,N2,A1
Bob 6-2 6 M N1,M1,M2,N2,A1,M3

This sequence must be computed for each day, not knowing about the future, this is, the third day of Alice has a sequence "M1,M2,A1", as it doesn't know about the fourth and fifth day yet. I am unable to think of a way, using python and pandas, to capture all these details. Any ideas?

EDIT: It would also be nice to have the full sequence as a new column, "Full_sequence", where it is shown regardless of the day, if possible


Solution

  • This is an older question, so the answer might be moot.

    Unfortunately GroupBy.cumsum (as well as GroupBy.transform("cumsum")) doesn't like strings (and lists), so one way to approach this would be to use accumulate from itertools:

    from itertools import accumulate
    
    def cumjoin(ser): return accumulate(ser, lambda acc, s: f"{acc},{s}")
    
    cycle = df.groupby("Worker")["Cycle_day"].diff().fillna(-1.0).lt(0.0).cumsum()
    df["Sequence"] = (
        (df["Shift"] + (df.groupby([cycle, "Shift"]).cumcount() + 1).astype("str"))
        .groupby(cycle).transform(cumjoin)
    )
    df["Full_sequence"] = df["Sequence"].groupby(cycle).transform("last")
    

    Result (df your dataframe):

       Worker Cycle_type  Cycle_day Shift           Sequence      Full_sequence
    0   Alice        5-3          1     M                 M1     M1,M2,A1,A2,N1
    1   Alice        5-3          2     M              M1,M2     M1,M2,A1,A2,N1
    2   Alice        5-3          3     A           M1,M2,A1     M1,M2,A1,A2,N1
    3   Alice        5-3          4     A        M1,M2,A1,A2     M1,M2,A1,A2,N1
    4   Alice        5-3          5     N     M1,M2,A1,A2,N1     M1,M2,A1,A2,N1
    5     Bob        6-2          1     N                 N1  N1,M1,M2,N2,A1,M3
    6     Bob        6-2          2     M              N1,M1  N1,M1,M2,N2,A1,M3
    7     Bob        6-2          3     M           N1,M1,M2  N1,M1,M2,N2,A1,M3
    8     Bob        6-2          4     N        N1,M1,M2,N2  N1,M1,M2,N2,A1,M3
    9     Bob        6-2          5     A     N1,M1,M2,N2,A1  N1,M1,M2,N2,A1,M3
    10    Bob        6-2          6     M  N1,M1,M2,N2,A1,M3  N1,M1,M2,N2,A1,M3
    

    There's a way to avoid accumulate, but it is a bit convoluted:

    cycle = df.groupby("Worker")["Cycle_day"].diff().fillna(-1.0).lt(0.0).cumsum()
    seq = df["Shift"] + (df.groupby([cycle, "Shift"]).cumcount() + 1).astype("str")
    
    grouper = seq.groupby(cycle)
    idx, idx_count = seq.index, grouper.transform("size") - grouper.cumcount()
    seq_df = pd.DataFrame({"Sequence": seq, "Cycle": cycle}).loc[idx.repeat(idx_count), :]
    
    grouper = seq_df.groupby(["Cycle", "Sequence"])
    groups = grouper.transform("size") - grouper.cumcount()
    df["Sequence"] = (
        seq_df.groupby(["Cycle", groups], sort=False).agg({"Sequence": ",".join})
        .reset_index(drop=True)
    )