I have a dataframe that contains groups of people's names in a column as a string. Then I also have a dataframe that contains a list of people's names (1 dataframe row per person's name). Sample dataset as described:
groupsofpeople = pd.DataFrame({'group':['Genie, Chris, Mike, Dee','Chase, Chris,
William S.','Mike, Tim', 'William S., Chris, Tim'], 'rank':[1,2,3,4]})
people = pd.DataFrame({'name':['Chris','Chase','Ross Rick', 'William S.'],
'rowcnt':[30,25,15,25]})
group rank
0 Genie, Chris, Mike, Dee 1
1 Chase, Chris, William S. 2
2 Mike, Tim 3
3 William S., Chris, Tim 4
>>> people
name rowcnt
0 Chris 2
1 Chase 3
2 Ross Rick 1
3 William S. 2
My goal is to iterate through the 'people' dataframe names and locate rows in groupsofpeople if 1 of the names in the group match. The extra caveat to this is that I also want to filter each row to only return the number of people in the rowcnt column. So if groupsofpeople dataframe has 5 rows with Chris in it, this statement before the filter is applied, then after the substring is found, only the first 2 rows are returned (as the groupsofpeople area in a specific order).
I dont have a full grasp on list comprehension to pull this off on my own, but I can tell you that this hardcoded solution would give the result of what I am striving towards if there is a way to loop through the names (in the str.contains()) as well as the rowcnt (in the head()):
founddf = groupsofpeople[groupsofpeople['group'].str.contains('Chris')].head(2)
upload = pd.concat([upload,founddf],ignore_index=False, sort=False)
founddf = groupsofpeople[groupsofpeople['group'].str.contains('Chase')].head(3)
upload = pd.concat([upload,founddf],ignore_index=False, sort=False)
founddf = groupsofpeople[groupsofpeople['group'].str.contains('Ross Rick')].head(1)
upload = pd.concat([upload,founddf],ignore_index=False, sort=False)
This is the result I would like to get to without hardcoding the loop/iterations):
group rank
Chase, Chris, William S. 2
Genie, Chris, Mike, Dee 1
Chase, Chris, William S. 2
Can anyone help me achieve this solution?
Here is a way to output the first n rows per person according to the 'rowcnt' value in the people
dataframe
names = set(people.name)
people2 = people.set_index('name')
(groupsofpeople
.assign(group_list=groupsofpeople['group'].str.split(', ').apply(lambda x: set(x).intersection(names)))
.explode('group_list')
.dropna(subset=['group_list'])
.groupby(['group_list'])
.apply(lambda d: d.head(people2.loc[d.name, 'rowcnt']))
.reset_index(drop=True)
#.drop('group_list', axis=1) #kept here to see groups
)
output:
group rank group_list
0 Chase, Chris, William S. 2 Chase
1 Genie, Chris, Mike, Dee 1 Chris
2 Chase, Chris, William S. 2 Chris
3 Chase, Chris, William S. 2 William S.
4 William S., Chris, Tim 4 William S.