Search code examples
pythonpython-3.xnlpspacy

How to get all noun phrases in Spacy


I am new to Spacy and I would like to extract "all" the noun phrases from a sentence. I'm wondering how I can do it. I have the following code:

import spacy

nlp = spacy.load("en")

file = open("E:/test.txt", "r")
doc = nlp(file.read())
for np in doc.noun_chunks:
    print(np.text)

But it returns only the base noun phrases, that is, phrases which don't have any other NP in them. That is, for the following phrase, I get the result below:

Phrase: We try to explicitly describe the geometry of the edges of the images.

Result: We, the geometry, the edges, the images.

Expected result: We, the geometry, the edges, the images, the geometry of the edges of the images, the edges of the images.

How can I get all the noun phrases, including nested phrases?


Solution

  • Please see commented code below to recursively combine the nouns. Code inspired by the Spacy Docs here

    import spacy
    
    nlp = spacy.load("en")
    
    doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
    
    for np in doc.noun_chunks: # use np instead of np.text
        print(np)
    
    print()
    
    # code to recursively combine nouns
    # 'We' is actually a pronoun but included in your question
    # hence the token.pos_ == "PRON" part in the last if statement
    # suggest you extract PRON separately like the noun-chunks above
    
    index = 0
    nounIndices = []
    for token in doc:
        # print(token.text, token.pos_, token.dep_, token.head.text)
        if token.pos_ == 'NOUN':
            nounIndices.append(index)
        index = index + 1
    
    
    print(nounIndices)
    for idxValue in nounIndices:
        doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
        span = doc[doc[idxValue].left_edge.i : doc[idxValue].right_edge.i+1]
        span.merge()
    
        for token in doc:
            if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON":
                print(token.text)