Search code examples
pythonpython-3.xpandasdataframe

How to group rows based on column ID in a pandas dataframe?


I have below the dataframe below df1:

ID       Label   Value
id_1     A
id_1     B
id_1     C
id_1     D
id_1     E
id_1             10
id_1             20
id_1             30
id_2     F
id_2     G
id_2     H
id_2             40
id_2             50
id_2             60
id_2             70
id_2             80
id_2             90

I would like to group the rows based on the ID column in the following way :

ID      Label   Value
id_1     A      10
id_1     B      20
id_1     C      30
id_1     D
id_1     E      
id_2     F      40
id_2     G      50
id_2     H      60
id_2            70
id_2            80
id_2            90

My goal is to always align the first value in column "Label" for a given ID, with the first value of column "Value" for the same ID (the empty cells inbetwwen numbers are expected).

How can I do this in the most optimal way ?

I tried with groupby feature but didn't manage to get what I want, Im pretty sure there is an optimal to do this but can't figure it out right now.


Solution

  • Assuming empty cells are NaN/None, you could count the number of leading empty rows in Value, of trailing ones in Label (with isna+cummin+sum), then groupby.apply to shift "Value" up and remove empty rows in the end with head:

    def cust_shift(g):
        # number of leading empty rows
        n1 = g['Value'].isna().cummin().sum()
        # number of trailing empty rows
        n2 = g.loc[::-1, 'Label'].isna().cummin().sum()
        # shift Value up and remove trailing empty rows
        return g.assign(Value=g['Value'].shift(-n1)).head(-min(n1, n2))
    
    out = df.groupby('ID', group_keys=False)[list(df)].apply(cust_shift)
    

    Output:

          ID Label  Value
    0   id_1     A   10.0
    1   id_1     B   20.0
    2   id_1     C   30.0
    3   id_1     D    NaN
    4   id_1     E    NaN
    8   id_2     F   40.0
    9   id_2     G   50.0
    10  id_2     H   60.0
    11  id_2  None   70.0
    

    Reproducible input:

    from numpy import nan
    df = pd.DataFrame({'ID': ['id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1',
                              'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2'],
                       'Label': ['A', 'B', 'C', 'D', 'E', None, None, None, 'F', 'G', 'H', None, None, None, None],
                       'Value': [nan, nan, nan, nan, nan, 10.0, 20.0, 30.0, nan, nan, nan, 40.0, 50.0, 60.0, 70.0]})
    

    empty cells are empty strings

    If your empty cells are in fact empty strings, just adapt the above code to use eq('') in place of isna and add fill_value='' in shift:

    def cust_shift(g):
        # number of leading empty rows
        n1 = g['Value'].eq('').cummin().sum()
        # number of trailing empty rows
        n2 = g.loc[::-1, 'Label'].eq('').cummin().sum()
        return (g.assign(Value=g['Value'].shift(-n1, fill_value=''))
                 .head(-min(n1, n2))
                )
    
    out = (df.groupby('ID', group_keys=False)[list(df)]
             .apply(cust_shift)
          )
    

    Output:

          ID Label Value
    0   id_1     A    10
    1   id_1     B    20
    2   id_1     C    30
    3   id_1     D      
    4   id_1     E      
    8   id_2     F    40
    9   id_2     G    50
    10  id_2     H    60
    11  id_2          70
    

    Alternative input:

    df = pd.DataFrame({'ID': ['id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1',
                              'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2'],
                       'Label': ['A', 'B', 'C', 'D', 'E', '', '', '', 'F', 'G', 'H', '', '', '', ''],
                       'Value': ['', '', '', '', '', 10, 20, 30, '', '', '', 40, 50, 60, 70]})