Search code examples
pythonpandasdataframetext

Counting the occurrence of words in a dataframe column using a list of strings


I have a list of strings and a dataframe with a text column. In the text column, I have lines of text. I want to count how many times each word in the list of strings occurs in the text column. I am aiming to add two columns to the dataframe; one column with the word and the other column having the number of occurrences. If there is a better solution, I am open to it. It would be great to learn different ways to accomplish this. I would ideally like one dataframe in the end.

string_list = ['had', 'it', 'the']

Current dataframe:

enter image description here

Dataframe in code:

pd.DataFrame({'title': {0: 'book1', 1: 'book2', 2: 'book3', 3: 'book4', 4: 'book5'},
 'text': {0: 'His voice had never sounded so cold',
  1: 'When she arrived home, she noticed that the curtains were closed.',
  2: 'He was terrified of small spaces and she knew',
  3: "It was time. She'd fought against it for so long",
  4: 'As he took in the view from the twentieth floor, the lights went out all over the city'},
 'had': {0: 1, 1: 5, 2: 5, 3: 2, 4: 5},
 'it': {0: 1, 1: 3, 2: 2, 3: 1, 4: 2},
 'the': {0: 1, 1: 4, 2: 5, 3: 3, 4: 3}})

Attempting to get a dataframe like this:

enter image description here


Solution

  • Define a custom regex, extractall, join, and melt:

    regex = '|'.join(fr'(?P<{w}>\b{w}\b)' for w in string_list)
    
    (df[['title', 'text']]
     .join(df['text'].str.extractall(regex).notna().groupby(level=0).sum())
     .fillna(0)
     .melt(id_vars=['title', 'text'], var_name='word', value_name='word count')
     )
    

    Output:

        title                                               text word  word count
    0   book1                His voice had never sounded so cold  had         1.0
    1   book2  When she arrived home, she noticed that the cu...  had         0.0
    2   book3      He was terrified of small spaces and she knew  had         0.0
    3   book4   It was time. She'd fought against it for so long  had         0.0
    4   book5  As he took in the view from the twentieth floo...  had         0.0
    5   book1                His voice had never sounded so cold   it         0.0
    6   book2  When she arrived home, she noticed that the cu...   it         0.0
    7   book3      He was terrified of small spaces and she knew   it         0.0
    8   book4   It was time. She'd fought against it for so long   it         1.0
    9   book5  As he took in the view from the twentieth floo...   it         0.0
    10  book1                His voice had never sounded so cold  the         0.0
    11  book2  When she arrived home, she noticed that the cu...  the         1.0
    12  book3      He was terrified of small spaces and she knew  the         0.0
    13  book4   It was time. She'd fought against it for so long  the         0.0
    14  book5  As he took in the view from the twentieth floo...  the         4.0