Search code examples
pythonpandaspandas-groupbycounter

Count occurences of set words that can be contained in a DataFrame column composed by a list of strings on a global and single row scale


I am hoping I am not creating a duplicate lol, but I spend more than hours looking for something similar to my questions :)

Said that, I have the following input:

foo= {"Brand":["loc doc poc",
               "roc top mop",
               "loc lot not",
               "roc lot tot",
               "loc bot sot",
               "nap rat sat"] }

word_list=["loc","top","lot"]
df=pd.DataFrame(foo) 

2 Desired Outputs

1 Dictionary with the occurrences stored

2 New column containing the number of occurrences for each row

#Outputs: 
counter_dic={"loc":3,"top":1,"lot":2}

            Brand   count
0   loc  doc  poc       1
1   roc  top  mop       1
2   loc  lot  not       2
3   roc  lot  tot       1
4   toc  bot  sot       1
5   nap  rat  sat       0

The only idea that I had:

  • Count how many times a set of terms occurs. I can create a bag of words and then filtering based on the dictionary keys?

If you find a similar question, this can be closed obviously.

I checked the following ones

This one of the most similar

Check If a String Is In A Pandas DataFrame

Python Lists Finding The Number Of Times A String Occurs

Count Occurrences Of A Substring In A List Of Strings


Solution

  • Here is one potential solution using str.count to create an interim count DataFrame which will help with both outputs.

    df_counts = pd.concat([df['Brand'].str.count(x).rename(x) for x in word_list], axis=1)
    

    Looks like:

       loc  top  lot
    0    1    0    0
    1    0    1    0
    2    1    0    1
    3    0    0    1
    4    1    0    0
    5    0    0    0
    

    1 - Dictionary with the occurrences stored

    df_counts.sum().to_dict()
    

    [out]

    {'loc': 3, 'top': 1, 'lot': 2}
    

    2 - New column containing the number of occurrences for each row

    df['count'] = df_counts.sum(axis=1)
    

    [out]

             Brand  count
    0  loc doc poc      1
    1  roc top mop      1
    2  loc lot not      2
    3  roc lot tot      1
    4  loc bot sot      1
    5  nap rat sat      0