Search code examples
pythonstringfind-occurrencesmultiple-occurrence

Count the occurrences of a wordlist within a string observation


I've a list of the top 10 most occurring words the abstract of academic article. I want to count how many times those words occur in the observations of my dataset.

The top 10 words are:

top10 = ['model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance']

An example of the first 3 observations are:

column[0:3] = ['The models are showing a great performance.',
'The information and therefor the data in the text are good enough to fulfill the task.',
'Data in this way results in the best information and thus performance'.]

The provided code should return a list of total occurrences of all the words in the specific observation. I've tried the following code but it gave error: count() takes at most 3 arguments (10 given).

My code:

count = 0
for sentence in column:
    for word in sentence.split():
        count += word.lower().count('model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance')

I also want to lowercase all words and remove the punctuation. So the output should look like this:

output = (2, 4, 4)

The first observation counts 2 words of the top10 list, namely models and performance

The second observation counts 4 words of the top10 list, namely information, data, text and task

The third observation counts 4 words of the data, results, data, information and performance

Hopefully you can help me out!


Solution

  • You can use regex to split and just check if it is in top 10.

    count =[]
    for i,sentence in enumerate(column):
        c = 0
        for word in re.findall('\w+',sentence):
            c += int(word.lower() in top10)
        count += [c]
    

    count = [2, 4, 4]