python spacy look for chunks backwards (before a reference)

I am using spacy for a NLP project. when creating a doc with Spacy you can find out the noun chunks in the text (also known as "noun phrases") in the following way:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"The companies building cars do not want to spend more money in improving diesel engines because the government will not subsidise such engines anymore.")
for chunk in doc.noun_chunks:
    print(chunk.text)

This will give a list of the noun phrases.

In this case for instance the first noun phrase is "The companies".

Suppose you have a text where noun chunks are referenced with a number.

like:

doc=nlp(the Window (23) is closed because the wall (34) of the beautiful building (45) is not covered by the insurance (45))

assume I have the code to identify the references for instance tagging them:

myprocessedtext=the Window <ref>(23)</ref> is closed because the wall <ref>(34)</ref> of the beautiful building <ref>(45)</ref> is not covered by the insurance <ref>(45)</ref>

How could I get the noun chunks (noun phrases) immediately preceding the references?

my idea: passing the 10 words preceding every reference to a spacy doc object, extract the noun chunks and getting the last one. This is highly inefficient since creating the doc objects is very high time consuming.

Any other idea without having to create extra nlp objects?

thanks.

Solution

You can analyze the whole document and then just find the noun chunk before each reference, either by token position or character offset. The token offset of the last token in a noun chunk is noun_chunk[-1].i and the character offset of the start of the last token is noun_chunk[-1].idx. (Check that the analysis isn't affected by the presence of the reference strings; your example (1)-style references seem to be analyzed as appositives, which is fine.)

If the analysis is affected by the reference strings, remove them from the document while keeping track of their character offsets, analyze the whole document, and then find the noun chunks preceding the saved positions.