I've a list of the top 10 most occurring words the abstract of academic article. I want to count how many times those words occur in the observations of my dataset.
The top 10 words are:
top10 = ['model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance']
An example of the first 3 observations are:
column[0:3] = ['The models are showing a great performance.',
'The information and therefor the data in the text are good enough to fulfill the task.',
'Data in this way results in the best information and thus performance'.]
The provided code should return a list of total occurrences of all the words in the specific observation. I've tried the following code but it gave error: count() takes at most 3 arguments (10 given).
My code:
count = 0
for sentence in column:
for word in sentence.split():
count += word.lower().count('model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance')
I also want to lowercase all words and remove the punctuation. So the output should look like this:
output = (2, 4, 4)
The first observation counts 2 words of the top10 list, namely models and performance
The second observation counts 4 words of the top10 list, namely information, data, text and task
The third observation counts 4 words of the data, results, data, information and performance
Hopefully you can help me out!
You can use regex to split and just check if it is in top 10.
count =[]
for i,sentence in enumerate(column):
c = 0
for word in re.findall('\w+',sentence):
c += int(word.lower() in top10)
count += [c]
count = [2, 4, 4]