Search code examples
pythonlistnltkpos-tagger

Extracting only nouns from list of lists pos_tag sequence?


I am trying to extract only nouns using the nltk.pos_tag(), from a list of lists text sequence. I am able to extract all the nouns from the nltk.pos_tag() list, without preserving the list of lists sequence? How to achieve this by preserving the list of lists sequence. Any help is highly appreciated.

Here, list of lists text sequence collection means: collection of tokenized words separated by lists.

[[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]

The output should look like:

[['cosmology', 'calculator'], ['generation'], [institute]]

What I have tried is as follows:

def function1():
    tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
    unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
    word_list = []
    for i in range(len(unfiltered_tokens)):
        word_list.append([]) 
    for i in range(len(unfiltered_tokens)):
        for word in unfiltered_tokens[i]:
            if word[:].isalpha():
               word_list[i].append(word[:])
    tagged_tokens=[]
    for token in word_list:
        tagged_tokens.append(nltk.pos_tag(token))
    noun_tagged = [(word,tag) for word, tag in tagged_tokens 
            if tag.startswith('NN') or tag.startswith('NNPS')]
    print(nouns_tagged)

If I used the below mention code-shippet in the original code after appending tagged_tokens list, the output is displayed in a single list, which is not required.

only_tagged_nouns = []
for sentence in tagged_tokens:
    for word, pos in sentence:
        if (pos == 'NN' or pos == 'NNPS'):
            only_tagged_nouns.append(word)

Solution

  • You can do:

    words = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
    
    new_list = []
    for i in words:
        temp = [j[0] for j in i if j[1].startswith("NN")]
        new_list.append(temp)
    
    print(new_list)
    

    Output

    [['cosmology', 'calculator'], ['generation'], ['institute']]