I have the following dataframe:
test = [{'ID': 13562, 'Role':'AR','Location': 'London'},
{'ID': 13562, 'Role':'ST','Location': 'London'},
{'ID': 13562, 'Role':'MH','Location': 'London'},
{'ID': 89912, 'Role':'ST','Location': 'Madrid'},
{'ID': 89912, 'Role':'UL','Location': 'Madrid'},
{'ID': 15673, 'Role':'AR','Location': 'Berlin'},
{'ID': 34467, 'Role':'AR','Location': 'Berlin'},
{'ID': 34777, 'Role':'ST','Location': 'Berlin'},
{'ID': 34777, 'Role':'UL','Location': 'Berlin'}]
And the following list:
tlist = ['AR', 'AR_ST_MH','ST_UL','UL']
I am trying to create a new dataframe where I can group the IDs and roles as it is shown in the list. As you can see every single role in the dataframe is separated in different rows repeating each ID for each role (a single ID can have one or many roles). What is being challenging for me is putting together the different roles that each ID has to match the roles in the tlist and get a unique row for each match.
So far I tried this:
test['grouproles'] = test.groupby(['ID'])['Role'].transform(lambda x:'_'.join(x))
But I got this:
There are two problems with this output:
To explain myself a bit better, this is the result I am looking for:
New dataframe with the new roles matching all the options in the tlist and repeating the IDs only for the requested matches.
You can use set operations:
tlist = ['AR', 'AR_ST_MH','ST_UL','UL']
d = {frozenset(s.split('_')): s for s in tlist}
# {frozenset({'AR'}): 'AR',
# frozenset({'AR', 'MH', 'ST'}): 'AR_ST_MH',
# frozenset({'ST', 'UL'}): 'ST_UL',
# frozenset({'UL'}): 'UL'}
tmp = df.groupby('ID')['Role'].agg(frozenset)
out = (df[['ID', 'Location']]
.drop_duplicates()
.merge(tmp.apply(lambda s: [v for k, v in d.items() if k<=s]).explode(), on='ID')
)
Same logic with strings, if the order is conserved:
tmp = df.groupby('ID')['Role'].agg('_'.join)
out = (df[['ID', 'Location']]
.drop_duplicates()
.merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on='ID')
)
Output:
ID Location Role
0 13562 London AR
1 13562 London AR_ST_MH
2 89912 Madrid ST_UL
3 89912 Madrid UL
4 15673 Berlin AR
5 34467 Berlin AR
6 34777 Berlin ST_UL
7 34777 Berlin UL
NB. the search is O(n²), you can however improve this by keeping a dictionary of valid sets for each item. For example, for AR_ST_MH
you don't have to search UL
.