Search code examples
pythonpandasgroup-by

How do I conditionally group rows of a dataframe?


In column 2 of df, there are three possible values: X, Y, Z. I want to group rows by the value X along with any trailing Y values in the columns directly following X. I am not interested in preserving the Z values in the groups.

I have tried using groupby() like this: df.groupby(df[2] == 'X'), however this obviously only grabs the X values.

How could I go about creating the groupings that I am after?

df = pd.DataFrame({'1':['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p'],
                   '2':['Z','X','Y','Z','Z','X','X','Z','X','Y','Y','Z','X','Z','X','Y']})

Desired groupings:

1   b   X
2   c   Y
---------
5   f   X
---------
6   g   X
---------
8   i   X
9   j   Y
10  k   Y
---------
12  m   X
---------
14  o   X
15  p   Y

Solution

  • You can use:

    >>> list(df.groupby(df['2'].eq('X').cumsum().loc[df['2'] != 'Z']))
    [(1.0,
         1  2
      1  b  X
      2  c  Y),
     (2.0,
         1  2
      5  f  X),
     (3.0,
         1  2
      6  g  X),
     (4.0,
          1  2
      8   i  X
      9   j  Y
      10  k  Y),
     (5.0,
          1  2
      12  m  X),
     (6.0,
          1  2
      14  o  X
      15  p  Y)]
    

    Details:

    # Same as grp = df['2'].eq('X').cumsum().where(df['2'] != 'Z')
    grp = df['2'].eq('X').cumsum().loc[df['2'] != 'Z']
    pd.concat([df, grp.rename('G')], axis=1)
    
    # Output
        1  2    G
    0   a  Z  NaN
    1   b  X  1.0
    2   c  Y  1.0
    3   d  Z  NaN
    4   e  Z  NaN
    5   f  X  2.0
    6   g  X  3.0
    7   h  Z  NaN
    8   i  X  4.0
    9   j  Y  4.0
    10  k  Y  4.0
    11  l  Z  NaN
    12  m  X  5.0
    13  n  Z  NaN
    14  o  X  6.0
    15  p  Y  6.0