Suppose I have this pandas dataset:
ID | Question Code |
---|---|
1 | Q01 |
1 | Q01-1 |
1 | Q02 |
2 | Q01 |
2 | Q02 |
2 | Q02-1 |
2 | Q02-1-1 |
2 | Q02-2 |
I want to remove the rows based on certain hierarchical conditions between the values of question codes per ID. For example, Q01-1 is a sub-question when Q01 is answered, soI don't need to keep Q01 anymore since we already have Q01-1. By ID 2, I need to show Q01, Q02-1-1 (since it is a sub-question of Q02-1, which is also one of Q02) and Q02-2 (since it is also another sub-question of Q02).
The desired final result of the table above would be:
ID | Question Code |
---|---|
1 | Q01-1 |
1 | Q02 |
2 | Q01 |
2 | Q02-1-1 |
2 | Q02-2 |
Thanks in advance for the help!
You could extract
the part before the trailing -xxx
and use this to identify the levels to drop with boolean indexing. Perform per group with groupby.transform
:
out = df[df.groupby('ID')['Question Code']
.transform(lambda x: ~x.isin(x.str.extract('(.*)-[^-]+$',
expand=False).dropna()))]
Output:
ID Question Code
1 1 Q01-1
2 1 Q02
3 2 Q01
6 2 Q02-1-1
7 2 Q02-2
Example intermediates for ID 2:
# s.str.extract('(.*)-[^-]+$', expand=False).dropna()
5 Q02
6 Q02-1
7 Q02
Name: Question Code, dtype: object
# ~s.isin(s.str.extract('(.*)-[^-]+$', expand=False).dropna())
3 True
4 False
5 False
6 True
7 True
Name: Question Code, dtype: bool