Search code examples
pythonpandasdataframesum

Add a column to dataframe summing another dataframe based on substring


I am trying to append a column to a dataframe which will sum a count from another dataframe based on the contents of a string cell. The dataframes I have look like this:

listy = pd.DataFrame(
    ['auto', 'ford', 'chevy', 'auto worker', 'ford motors', 'chevy trucks'])

listy['count'] = np.random.randint(1,500,size=6)
listy
0 count
0 auto 24
1 ford 45
2 chevy 384
3 auto worker 81
4 ford motors 156
5 chevy trucks 66

and

lister = pd.DataFrame(['auto','ford','chevy'])
0
0 auto
1 ford
2 Chevy

I'm trying to add a column to lister dataframe that sums listy['count'] of any row where ['0'] contains the value of listy[0]

I have gotten the following code to work for individual words

listy[listy[0].str.contains("auto")]['count'].sum()

Which returns the value I am looking for. My actual data set is thousands of rows so manually collecting totals just won't do.

I have tried the following code:

lister['total'] = listy[listy[0].str.contains(lister[0])]

and get the error

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Which I have tried googling but to no avail. Is this even possible?

I have also tried using split on the multi-word column but have still been unable to come up with totals.

Any help is greatly appreciated.


Solution

  • One way using pandas.Series.str.extract:

    pat = f"({'|'.join(lister[0])})"
    df = listy.groupby(listy[0].str.extract(pat, expand=False))["count"].sum()
    print(df.reset_index())
    

    Output:

           0  count
    0   auto    697
    1  chevy    806
    2   ford    302