Search code examples
pythonpandasnltkpos-tagger

How can I create a pandas dataframe column for each part-of-speech tag?


I have a dataset that consists of tokenized, POS-tagged phrases as one column of a dataframe:

Current Dataframe

I want to create a new column in the dataframe, consisting only of the proper nouns in the previous column:

Desired Solution

Right now, I'm trying something like this for a single row:

if 'NNP' in df['Description_POS'][96][0:-1]:
df['Proper Noun'] = df['Description_POS'][96]

But then I don't know how to loop this for each row, and how to obtain the tuple which contains the proper noun. I'm very new right now and at a loss for what to use, so any help would be really appreciated!

Edit: I tried the solution recommended, and it seems to work, but there is an issue.

this was my dataframe: Original dataframe

After implementing the code recommended

df['Proper Nouns'] = df['POS_Description'].apply(
    lambda row: [i[0] for i in row if i[1] == 'NNP']) 

it looks like this: Dataframe after creating a proper nouns column


Solution

  • You can use the apply method, which as the name suggests will apply the given function to every row of the dataframe or series. This will return a series, which you can add as a new column to your dataframe

    df['Proper Nouns'] = df['POS_Description'].apply(
        lambda row: [i[0] for i in row if i[1] == 'NNP'])
    

    I am assuming the POS_Description dtype to be a list of tuples.