I am trying to append a column to a dataframe which will sum a count from another dataframe based on the contents of a string cell. The dataframes I have look like this:
listy = pd.DataFrame(
['auto', 'ford', 'chevy', 'auto worker', 'ford motors', 'chevy trucks'])
listy['count'] = np.random.randint(1,500,size=6)
listy
0 | count | |
---|---|---|
0 | auto | 24 |
1 | ford | 45 |
2 | chevy | 384 |
3 | auto worker | 81 |
4 | ford motors | 156 |
5 | chevy trucks | 66 |
and
lister = pd.DataFrame(['auto','ford','chevy'])
0 | |
---|---|
0 | auto |
1 | ford |
2 | Chevy |
I'm trying to add a column to lister dataframe that sums listy['count'] of any row where ['0'] contains the value of listy[0]
I have gotten the following code to work for individual words
listy[listy[0].str.contains("auto")]['count'].sum()
Which returns the value I am looking for. My actual data set is thousands of rows so manually collecting totals just won't do.
I have tried the following code:
lister['total'] = listy[listy[0].str.contains(lister[0])]
and get the error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Which I have tried googling but to no avail. Is this even possible?
I have also tried using split on the multi-word column but have still been unable to come up with totals.
Any help is greatly appreciated.
One way using pandas.Series.str.extract
:
pat = f"({'|'.join(lister[0])})"
df = listy.groupby(listy[0].str.extract(pat, expand=False))["count"].sum()
print(df.reset_index())
Output:
0 count
0 auto 697
1 chevy 806
2 ford 302