Search code examples
pythonnlptokenizespacy

Is there a way to get entire constituents using SpaCy?


I guess I'm trying to navigate SpaCy's parse tree in a more blunt way than is provided.

For instance, if I have sentences like: "He was a genius" or "The dog was green," I want to be able to save the objects to variables ("a genius" and "green").

token.children provides the IMMEDIATE syntactic dependents, so, for the first example, the children of "was" are "he" and "genius," and then "a" is a child of "genius." This isn't so helpful if I just want the entire constituent "a genius." I'm not sure how to reconstruct it from the token.children or if there's a better way.

I can figure out how to match "is" and "was" using token.text (part of what I'm trying to do), but I can't figure out how to return the whole constituent "a genius" using the info provided about children.

import spacy
nlp = spacy.load('en_core_web_sm')

sent = nlp("He was a genius.")

for token in sent:
     print(token.text, token.tag_, token.dep_, [child for child in token.children])

This is the output:

He PRP nsubj []

was VBD ROOT [He, genius, .]

a DT det []

genius NN attr [a]

. . punct []


Solution

  • You can use Token.subtree (see the docs) to get all dependents of a given node in the dependency tree.

    For example, to get all noun phrases:

    import spacy
    
    nlp = spacy.load('en')
    
    text = "He was a genius of the best kind and his dog was green."
    
    for token in nlp(text):
        if token.pos_ in ['NOUN', 'ADJ']:
            if token.dep_ in ['attr', 'acomp'] and token.head.lemma_ == 'be':
                # to test for only verb forms 'is' and 'was' use token.head.lower_ in ['is', 'was']
                print([t.text for t in token.subtree])
    

    Outputs:

    ['a', 'genius', 'of', 'the', 'best', 'kind']
    ['green']