Search code examples
machine-learningnlpspacy

Can we find sentences around an entity tagged via NER?


We have a model ready which identifies a custom named entity. The problem is if the whole doc is given then the model does not work as per expecation if only a few sentences are given, it is giving amazing results.

I want to select two sentences before and after a tagged entity.

eg. If a part of the doc has world Colombo(which is tagged as GPE), I need to select two sentences before the tag and 2 sentences after the tag. I tried a couple of approaches but the complexity is too high.

Is there a built-in way in spacy with which we can address this problem?

I am using python and spacy.

I have tried parsing the doc by identifying the index of the tag. But that approach is really slow.


Solution

  • It might be worth it to see if you can improve the custom named entity recognizer, because it should be unusual for extra context to hurt performance and potentially if you fix that issue it will work better overall.

    However, regarding your concrete question about surrounding sentences:

    A Token or a Span (an entity is a Span) has a .sent attribute that gives you the covering sentence as a Span. If you look at the tokens right before/after a given sentence's start/end tokens, you can get the previous/next sentences for any token in a document.

    import spacy
    
    def get_previous_sentence(doc, token_index):
        if doc[token_index].sent.start - 1 < 0:
            return None
        return doc[doc[token_index].sent.start - 1].sent
    
    def get_next_sentence(doc, token_index):
        if doc[token_index].sent.end + 1 >= len(doc):
            return None
        return doc[doc[token_index].sent.end + 1].sent
    
    nlp = spacy.load('en_core_web_lg')
    
    text = "Jane is a name. Here is a sentence. Here is another sentence. Jane was the mayor of Colombo in 2010. Here is another filler sentence. And here is yet another padding sentence without entities. Someone else is the mayor of Colombo right now."
    
    doc = nlp(text)
    
    for ent in doc.ents:
        print(ent, ent.label_, ent.sent)
        print("Prev:", get_previous_sentence(doc, ent.start))
        print("Next:", get_next_sentence(doc, ent.start))
        print("----")
    

    Output:

    Jane PERSON Jane is a name.
    Prev: None
    Next: Here is a sentence.
    ----
    Jane PERSON Jane was the mayor of Colombo in 2010.
    Prev: Here is another sentence.
    Next: Here is another filler sentence.
    ----
    Colombo GPE Jane was the mayor of Colombo in 2010.
    Prev: Here is another sentence.
    Next: Here is another filler sentence.
    ----
    2010 DATE Jane was the mayor of Colombo in 2010.
    Prev: Here is another sentence.
    Next: Here is another filler sentence.
    ----
    Colombo GPE Someone else is the mayor of Colombo right now.
    Prev: And here is yet another padding sentence without entities.
    Next: None
    ----