Search code examples
pythonpandasstringliststring-concatenation

Concatenate strings from different rows in a dataframe matching the values of a list


I have the following dataframe:

test = [{'ID': 13562, 'Role':'AR','Location': 'London'},
    {'ID': 13562, 'Role':'ST','Location': 'London'},
    {'ID': 13562, 'Role':'MH','Location': 'London'},
    {'ID': 89912, 'Role':'ST','Location': 'Madrid'},
    {'ID': 89912, 'Role':'UL','Location': 'Madrid'},
    {'ID': 15673, 'Role':'AR','Location': 'Berlin'},
    {'ID': 34467, 'Role':'AR','Location': 'Berlin'},
    {'ID': 34777, 'Role':'ST','Location': 'Berlin'},
    {'ID': 34777, 'Role':'UL','Location': 'Berlin'}] 

And the following list:

tlist = ['AR', 'AR_ST_MH','ST_UL','UL']

I am trying to create a new dataframe where I can group the IDs and roles as it is shown in the list. As you can see every single role in the dataframe is separated in different rows repeating each ID for each role (a single ID can have one or many roles). What is being challenging for me is putting together the different roles that each ID has to match the roles in the tlist and get a unique row for each match.

So far I tried this:

test['grouproles'] = test.groupby(['ID'])['Role'].transform(lambda x:'_'.join(x))

But I got this:

.

There are two problems with this output:

  • I am getting repeated the roles. What I would really like is to keep only the rows that matches the values in the list.
  • The second problem is that it does not give me all the values I am asking for. For example, one of the values in the list is 'AR' only, as you see the ID 13562 has got the role AR but the output does not give me that option.

To explain myself a bit better, this is the result I am looking for:

..

New dataframe with the new roles matching all the options in the tlist and repeating the IDs only for the requested matches.


Solution

  • You can use set operations:

    tlist = ['AR', 'AR_ST_MH','ST_UL','UL']
    d = {frozenset(s.split('_')): s for s in tlist}
    # {frozenset({'AR'}): 'AR',
    #  frozenset({'AR', 'MH', 'ST'}): 'AR_ST_MH',
    #  frozenset({'ST', 'UL'}): 'ST_UL',
    #  frozenset({'UL'}): 'UL'}
    
    tmp = df.groupby('ID')['Role'].agg(frozenset)
    
    out = (df[['ID', 'Location']]
     .drop_duplicates()
     .merge(tmp.apply(lambda s: [v for k, v in d.items() if k<=s]).explode(), on='ID')
    )
    

    Same logic with strings, if the order is conserved:

    tmp = df.groupby('ID')['Role'].agg('_'.join)
    
    out = (df[['ID', 'Location']]
     .drop_duplicates()
     .merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on='ID')
    )
    

    Output:

          ID Location      Role
    0  13562   London        AR
    1  13562   London  AR_ST_MH
    2  89912   Madrid     ST_UL
    3  89912   Madrid        UL
    4  15673   Berlin        AR
    5  34467   Berlin        AR
    6  34777   Berlin     ST_UL
    7  34777   Berlin        UL
    

    NB. the search is O(n²), you can however improve this by keeping a dictionary of valid sets for each item. For example, for AR_ST_MH you don't have to search UL.