Search code examples
pythonpandasnlpnltk

How to return a list from a pos tag column?


these are my modules:

import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

I have a df that is similar to this:

df = pd.DataFrame({'comments': ['Daniel is really cool',
                                'Daniel is the most',
                                'We had such a',
                                'Very professional operation',
                                'Lots of bookcases']})

Then I pass through the following:

df['tokenized'] = df['comments'].apply(word_tokenize)
df['tagged'] = df['tokenized'].apply(pos_tag)
df['lower_tagged'] = df['tokenized'].apply(lambda lt: [word.lower() for word in lt]).apply(pos_tag)

The column I am interested in is the lower tagged column

0    [(daniel, NN), (is, VBZ), (really, RB), (cool,...
1    [(daniel, NN), (is, VBZ), (the, DT), (most, RBS)]
2         [(we, PRP), (had, VBD), (such, JJ), (a, DT)]
3    [(very, RB), (professional, JJ), (operation, NN)]
4            [(lots, NNS), (of, IN), (bookcases, NNS)]

I am trying to implement a function which returns a list of the 1,000 most frequent nouns in the lower_tagged column.

The expected outcome should look something like:

nouns = ['daniel', 'operation', 'bookcases', 'lots']

One method I have tried is as follows:

lower_tag = df['lower_tagged']
print([t[0] for t in lower_tag if t[1] == 'NN'])

However, this just returns an empty list. Another method i've tried:

def list_nouns(df):
    s = lower_tag
    nouns = [word for word, pos in pos_tag(word_tokenize(s)) if pos.startswith('NN')]
    return nouns

However, I get this error: expected string or bytes-like object

Apologises for the long post - any suggestions would be much appreciated as I have been stuck on this for a while! Thanks


Solution

  • Create a new DataFrame with explode and tolist then use loc on a boolean index created with str.startswith, then nlargest to get only the value counts for each word:

    top_n_words = 2
    new_df = pd.DataFrame(
        df['lower_tagged'].explode().tolist(),
        columns=['word', 'part_of_speech']
    )
    nouns = new_df[
        new_df['part_of_speech'].str.startswith('NN')
    ].value_counts().reset_index().nlargest(top_n_words, 0)['word'].tolist()
    

    Or explode then use str accessor and str.startswith to create a boolean index on the series then nlargest to get only the value counts for each word:

    top_n_words = 2
    s = df['lower_tagged'].explode()
    nouns = (
        s[s.str[-1].str.startswith('NN')].str[0]
            .value_counts()
            .reset_index()
            .nlargest(top_n_words, 'lower_tagged')['index'].tolist()
    )
    

    Just change top_n_words to select however many words are needed.

    nouns for top_n_words = 2:

    ['daniel', 'bookcases']