Search code examples
python-3.xnlp

How to count the occurance of specific words in every sentences in a paragraph in a dataframe in Python


I am working on a big data of customer survey in a dataframe. I need to find the number of times 2 words "customer" & "consumer" are coming together in a sentence and count the ccurance. The problem is the data is a running text as given below

df=pd.read_excel("Raw Data.xlsx")
df.head()

ID   sentence
1    There is a issue with the way the website works. The customer is asked to register but the icon does 
     not work resulting in the a long wait period. Resolution: Consumer request has been forwarded the 
     concerned department
2    This package on your website gives us information to consumer buying pattern. The customer is buying 
     this item mutiple time. This reminds us of the diaper and beer example. But the customer knows what 
     to buy, hence the recommender system is not accurate
     Consumer is asking for a resolution and the same is being worked on by the customer department will 
     keep you posted
     The customer department contacted the consumer on further clarifications.

i have converted the above into sentence vectorizer by using the sent_vecortizer code

df = df.join(df.sentence.apply(sent_tokenize).rename('SENTENCES'))

now i need to get the number of time the words "customer" and "Consumer" are appearing together in each sentence.

Desired Output

ID   SENTENCES                                                                       Occurance
1    There is a issue with the way the website works.                                  0
     The customer is asked to register but the icon does 
     not work resulting in the a long wait period. 
     Resolution: Consumer request has been forwarded the 
     concerned department
2    This package on your website gives us information to consumer buying pattern.     2
     The customer is buying this item mutiple time. 
     This reminds us of the diaper and beer example. 
     But the customer knows what to buy, hence the recommender system is not accurate.
     Consumer is asking for a resolution and the same is being worked on by the customer department will 
     keep you posted.
     The customer department contacted the consumer on further clarifications.

Solution

  • You could go about this using sets and lambda functions. The goal is: count how many sentences in each row have the token "consumer" and the token "customer".

    Let's prepare the set you're going to use:

    tokens = {'consumer', 'customer'}
    

    Next, let's use a lambda function to create the new column:

    df['count'] = df.SENTENCES.apply(lambda x: sum([len(tokens.intersection(sent.split())) > 1 for sent in x])
    

    Breaking this down bit by bit so you can see what it does:

    # Check if both tokens appear in the sentence (returns True if they do)
    len(tokens.intersection(sent.split())) > 1
    
    # Do this for every sentence in the row
    [len(tokens.intersection(sent.split())) > 1 for sent in x]
    # This looks like [True, True, False, False, etc]
    
    # Add up all the True values
    sum([len(tokens.intersection(sent.split())) > 1 for sent in x])