Search code examples
nlptokenspacychunksphrase

Spacy, noun phrases: How to locate noun phrase span start and end token of every noun_chunk in doc with spacy


I am using spacy to get the noun phrases of a text. What I want to do is locate those noun phrases in the text with respect to the token index of the words.

For instance

import spacy

# Load English 
nlp = spacy.load("en_core_web_sm")
doc = nlp("The blue car is nicer than the white car"
noun_chunks = list(doc.noun_chunks)

for i,noun_chunk in enumerate(noun_chunks):
    for j,token in enumerate(noun_chunk):
        print(i,noun_chunk,j,token.text)

The value j is an index of the token.text within the span of the noun chunk, but I want to get the token.i number of the first and last word of the noun_chunk

In the example the two noun chunks are: "the red car" and "the white car"

the desired output would be:

tokens: The 1 blue 2 car 3 is 4 nicer 5 than 6 the 7 white 8 car 9

noun chunk 1: "the blue car"; starts 1, ends 3

noun chunk 2: "the white car"; starts 7, ends 9

with the start and end of a noun chunk I will be able to identify the span of the noun chunk in the doc

Thanks


Solution

  • I did not know about the start and end method of a chunk

    chunk.start gives you the start token number of the chunk span chunk.end gives you the end token number of the chunk span