Search code examples
pythondataframelist-comprehensiondata-filtering

How can I filter this pandas dataframe based with substrings and row counts stored in another dataframe?


I have a dataframe that contains groups of people's names in a column as a string. Then I also have a dataframe that contains a list of people's names (1 dataframe row per person's name). Sample dataset as described:

    groupsofpeople = pd.DataFrame({'group':['Genie, Chris, Mike, Dee','Chase, Chris, 
        William S.','Mike, Tim', 'William S., Chris, Tim'], 'rank':[1,2,3,4]})
    people = pd.DataFrame({'name':['Chris','Chase','Ross Rick', 'William S.'], 
        'rowcnt':[30,25,15,25]})

                          group  rank
    0   Genie, Chris, Mike, Dee     1
    1  Chase, Chris, William S.     2
    2                 Mike, Tim     3
    3    William S., Chris, Tim     4

    >>> people
             name  rowcnt
    0       Chris      2
    1       Chase      3
    2   Ross Rick      1
    3  William S.      2

My goal is to iterate through the 'people' dataframe names and locate rows in groupsofpeople if 1 of the names in the group match. The extra caveat to this is that I also want to filter each row to only return the number of people in the rowcnt column. So if groupsofpeople dataframe has 5 rows with Chris in it, this statement before the filter is applied, then after the substring is found, only the first 2 rows are returned (as the groupsofpeople area in a specific order).

I dont have a full grasp on list comprehension to pull this off on my own, but I can tell you that this hardcoded solution would give the result of what I am striving towards if there is a way to loop through the names (in the str.contains()) as well as the rowcnt (in the head()):

    founddf = groupsofpeople[groupsofpeople['group'].str.contains('Chris')].head(2)
    upload = pd.concat([upload,founddf],ignore_index=False, sort=False)
    founddf = groupsofpeople[groupsofpeople['group'].str.contains('Chase')].head(3)
    upload = pd.concat([upload,founddf],ignore_index=False, sort=False)
    founddf = groupsofpeople[groupsofpeople['group'].str.contains('Ross Rick')].head(1)
    upload = pd.concat([upload,founddf],ignore_index=False, sort=False)

This is the result I would like to get to without hardcoding the loop/iterations):

          group                  rank
       Chase, Chris, William S.     2
       Genie, Chris, Mike, Dee      1
       Chase, Chris, William S.     2

Can anyone help me achieve this solution?


Solution

  • Here is a way to output the first n rows per person according to the 'rowcnt' value in the people dataframe

    names = set(people.name)
    people2 = people.set_index('name')
    
    (groupsofpeople
     .assign(group_list=groupsofpeople['group'].str.split(', ').apply(lambda x: set(x).intersection(names)))
     .explode('group_list')
     .dropna(subset=['group_list'])
     .groupby(['group_list'])
     .apply(lambda d: d.head(people2.loc[d.name, 'rowcnt']))
     .reset_index(drop=True)
     #.drop('group_list', axis=1) #kept here to see groups
    )
    

    output:

                          group  rank  group_list
    0  Chase, Chris, William S.     2       Chase
    1   Genie, Chris, Mike, Dee     1       Chris
    2  Chase, Chris, William S.     2       Chris
    3  Chase, Chris, William S.     2  William S.
    4    William S., Chris, Tim     4  William S.