Search code examples
python-3.xstringpandasfrequency

Pandas: How many times does a string appear in a dataframe cell?


I believe to have a simple problem. I have a pandas dataframe df looking quite similar to this:

data = [{"Text" : "Dog", "Dog" : 1},
        {"Text" : "Cat", "Dog" : 0}, 
        {"Text" : "Mouse", "Dog" : 0}, 
        {"Text" : "Dog", "Dog" : 1}]

df = pd.DataFrame(data)

I am trying to search the column Text for a number of keywords and count how many times they appear in each cell. The result is supposed to be stored in a new column that shows how many times the specific keyword was found. The result is supposed to be just like the Dog column.

I tried using pandas str.count. It works just fine. But in the moment I try to store the result in a new column, I run in to trouble:

mykewords = ('Cat', 'Mouse')
df['Cat'] = df.Text.str.count("Cat")

I get the following error message:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

I have two questions:

  1. What am I doing wrong and how can I solve it?
  2. How can loop through all keywords in mykeywords and get a column each?

Thank you very much for any help in advance!


Solution

  • If possible multiple values in text and need count values:

    mykewords = ('Cat', 'Mouse')
    for x in mykewords:
        df[x] = df.Text.str.count(x)
    

    Better solution is use words boundaries with Series.str.findall and Series.str.len:

    for x in mykewords:
        df[x] = df.Text.str.findall(r"\b{}\b".format(x)).str.len()
    

    Difference in solutions:

    data = [{"Text" : "Dog Cat Catman", "Dog" : 1},
            {"Text" : "Cat Cat", "Dog" : 0}, 
            {"Text" : "Mouse Cat", "Dog" : 0}, 
            {"Text" : "Dog", "Dog" : 1}]
    
    df = pd.DataFrame(data)
    df1 = df.copy()
    print (df)
       Dog            Text
    0    1  Dog Cat Catman
    1    0         Cat Cat
    2    0       Mouse Cat
    3    1             Dog
    
    mykewords = ('Cat', 'Mouse')
    
    for x in mykewords:
        df[x] = df.Text.str.findall(r"\b{}\b".format(x)).str.len()
    print (df)
       Dog            Text  Cat  Mouse
    0    1  Dog Cat Catman    1      0 <-not match Catman
    1    0         Cat Cat    2      0
    2    0       Mouse Cat    1      1
    3    1             Dog    0      0
    
    for x in mykewords:
        df1[x] = df1.Text.str.count(x)
    print (df1)
       Dog            Text  Cat  Mouse
    0    1  Dog Cat Catman    2      0 <-match Catman
    1    0         Cat Cat    2      0
    2    0       Mouse Cat    1      1
    3    1             Dog    0      0