Search code examples
pythonnlpspacydependency-parsing

Is there a way to retrieve the whole noun chunk using a root token in spaCy?


I'm very new to using spaCy. I have been reading the documentation for hours and I'm still confused if it's possible to do what I have in my question. Anyway...

As the title says, is there a way to actually get a given noun chunk using a token containing it. For example, given the sentence:

"Autonomous cars shift insurance liability toward manufacturers"

Would it be possible to get the "autonomous cars" noun chunk when what I only have the "cars" token? Here is an example snippet of the scenario that I'm trying to go for.

startingSentence = "Autonomous cars and magic wands shift insurance liability toward manufacturers"
doc = nlp(startingSentence)
noun_chunks = doc.noun_chunks

for token in doc:
    if token.dep_ == "dobj":
        print(child) # this will print "liability"

        # Is it possible to do anything from here to actually get the "insurance liability" token?

Any help will be greatly appreciated. Thanks!


Solution

  • You can easily find the noun chunk that contains the token you've identified by checking if the token is in one of the noun chunk spans:

    doc = nlp("Autonomous cars and magic wands shift insurance liability toward manufacturers")
    interesting_token = doc[7] # or however you identify the token you want
    for noun_chunk in doc.noun_chunks:
        if interesting_token in noun_chunk:
            print(noun_chunk)
    

    The output is not correct with en_core_web_sm and spacy 2.0.18 because shift isn't identified as a verb, so you get:

    magic wands shift insurance liability

    With en_core_web_md, it's correct:

    insurance liability

    (It makes sense to include examples with real ambiguities in the documentation because that's a realistic scenario (https://spacy.io/usage/linguistic-features#noun-chunks), but it's confusing for new users if they're ambiguous enough that the analysis is unstable across versions/models.)