Search code examples
pythonpandasdataframehierarchical-datadata-transform

Dropping grouped rows based on a certain hierarchical column


Suppose I have this pandas dataset:

ID Question Code
1 Q01
1 Q01-1
1 Q02
2 Q01
2 Q02
2 Q02-1
2 Q02-1-1
2 Q02-2

I want to remove the rows based on certain hierarchical conditions between the values of question codes per ID. For example, Q01-1 is a sub-question when Q01 is answered, soI don't need to keep Q01 anymore since we already have Q01-1. By ID 2, I need to show Q01, Q02-1-1 (since it is a sub-question of Q02-1, which is also one of Q02) and Q02-2 (since it is also another sub-question of Q02).

The desired final result of the table above would be:

ID Question Code
1 Q01-1
1 Q02
2 Q01
2 Q02-1-1
2 Q02-2

Thanks in advance for the help!


Solution

  • You could extract the part before the trailing -xxx and use this to identify the levels to drop with boolean indexing. Perform per group with groupby.transform:

    out = df[df.groupby('ID')['Question Code']
               .transform(lambda x: ~x.isin(x.str.extract('(.*)-[^-]+$',
                                            expand=False).dropna()))]
    

    Output:

       ID Question Code
    1   1         Q01-1
    2   1           Q02
    3   2           Q01
    6   2       Q02-1-1
    7   2         Q02-2
    

    Example intermediates for ID 2:

    # s.str.extract('(.*)-[^-]+$', expand=False).dropna()
    5      Q02
    6    Q02-1
    7      Q02
    Name: Question Code, dtype: object
    
    # ~s.isin(s.str.extract('(.*)-[^-]+$', expand=False).dropna())
    3     True
    4    False
    5    False
    6     True
    7     True
    Name: Question Code, dtype: bool