these are my modules:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
I have a df that is similar to this:
df = pd.DataFrame({'comments': ['Daniel is really cool',
'Daniel is the most',
'We had such a',
'Very professional operation',
'Lots of bookcases']})
Then I pass through the following:
df['tokenized'] = df['comments'].apply(word_tokenize)
df['tagged'] = df['tokenized'].apply(pos_tag)
df['lower_tagged'] = df['tokenized'].apply(lambda lt: [word.lower() for word in lt]).apply(pos_tag)
The column I am interested in is the lower tagged column
0 [(daniel, NN), (is, VBZ), (really, RB), (cool,...
1 [(daniel, NN), (is, VBZ), (the, DT), (most, RBS)]
2 [(we, PRP), (had, VBD), (such, JJ), (a, DT)]
3 [(very, RB), (professional, JJ), (operation, NN)]
4 [(lots, NNS), (of, IN), (bookcases, NNS)]
I am trying to implement a function which returns a list of the 1,000 most frequent nouns in the lower_tagged column.
The expected outcome should look something like:
nouns = ['daniel', 'operation', 'bookcases', 'lots']
One method I have tried is as follows:
lower_tag = df['lower_tagged']
print([t[0] for t in lower_tag if t[1] == 'NN'])
However, this just returns an empty list. Another method i've tried:
def list_nouns(df):
s = lower_tag
nouns = [word for word, pos in pos_tag(word_tokenize(s)) if pos.startswith('NN')]
return nouns
However, I get this error: expected string or bytes-like object
Apologises for the long post - any suggestions would be much appreciated as I have been stuck on this for a while! Thanks
Create a new DataFrame with explode
and tolist
then use loc
on a boolean index created with str.startswith
, then nlargest
to get only the value counts for each word:
top_n_words = 2
new_df = pd.DataFrame(
df['lower_tagged'].explode().tolist(),
columns=['word', 'part_of_speech']
)
nouns = new_df[
new_df['part_of_speech'].str.startswith('NN')
].value_counts().reset_index().nlargest(top_n_words, 0)['word'].tolist()
Or explode
then use str
accessor and str.startswith
to create a boolean index on the series then nlargest
to get only the value counts for each word:
top_n_words = 2
s = df['lower_tagged'].explode()
nouns = (
s[s.str[-1].str.startswith('NN')].str[0]
.value_counts()
.reset_index()
.nlargest(top_n_words, 'lower_tagged')['index'].tolist()
)
Just change top_n_words
to select however many words are needed.
nouns
for top_n_words = 2
:
['daniel', 'bookcases']