Search code examples
python-3.xnlpspacynamed-entity-recognition

Get the start and end position of found named entities


I am very new to ML and also Spacy in general. I am trying to show Named Entities from an input text.

This is my method:

def run():

    nlp = spacy.load('en_core_web_sm')
    sentence = "Hi my name is Oliver!"
    doc = nlp(sentence)

    #Threshold for the confidence socres.
    threshold = 0.2
    beams = nlp.entity.beam_parse(
        [doc], beam_width=16, beam_density=0.0001)

    entity_scores = defaultdict(float)
    for beam in beams:
        for score, ents in nlp.entity.moves.get_beam_parses(beam):
            for start, end, label in ents:
                entity_scores[(start, end, label)] += score

    #Create a dict to store output.
    ners = defaultdict(list)
    ners['text'] = str(sentence)

    for key in entity_scores:
        start, end, label = key
        score = entity_scores[key]
        if (score > threshold):
            ners['extractions'].append({
                "label": str(label),
                "text": str(doc[start:end]),
                "confidence": round(score, 2)
            })

    pprint(ners)

The above method works fine, and will print something like:

'extractions': [{'confidence': 1.0,
                'label': 'PERSON',
                'text': 'Oliver'}],
'text': 'Hi my name is Oliver'})

So far so good. Now I am trying to get the actual position of the found named entity. In this case "Oliver".

Looking at the documentation, there is: ent.start_char, ent.end_char available, but if I use that:

"start_position": doc.start_char,
"end_position": doc.end_char

I get the following error:

AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'start_char'

Can someone guide me in the right direction?


Solution

  • So I actually found an answer right after posting this question (typical).

    I found that I didn't need to save the information into entity_scores, but instead just iterate over the actual found entities ent:

    I ended up adding for ent in doc.ents: instead and this gives me access to all the standard Spacy attributes. See below:

    ners = defaultdict(list)
    ners['text'] = str(sentence)
    for beam in beams:
        for score, ents in nlp.entity.moves.get_beam_parses(beam):
            for ent in doc.ents:
                if (score > threshold):
                    ners['extractions'].append({
                        "label": str(ent.label_),
                        "text": str(ent.text),
                        "confidence": round(score, 2),
                        "start_position": ent.start_char,
                        "end_position": ent.end_char
    

    My entire method ends up looking like this:

    def run():
        nlp = spacy.load('en_core_web_sm')
        sentence = "Hi my name is Oliver!"
        doc = nlp(sentence)
    
        threshold = 0.2
        beams = nlp.entity.beam_parse(
            [doc], beam_width=16, beam_density=0.0001)
    
        ners = defaultdict(list)
        ners['text'] = str(sentence)
        for beam in beams:
            for score, ents in nlp.entity.moves.get_beam_parses(beam):
                for ent in doc.ents:
                    if (score > threshold):
                        ners['extractions'].append({
                            "label": str(ent.label_),
                            "text": str(ent.text),
                            "confidence": round(score, 2),
                            "start_position": ent.start_char,
                            "end_position": ent.end_char
                        })