Search code examples
pythonspacynamed-entity-recognition

Spacy Doc Bin Creation for NER


I am trying to create doc bin object for Custom NER. I have around 100 tagged datas for training (Just as a start)

I am getting skipping entity message while creation.

 54%|██████████████████████▊                   | 43/79 [00:00<00:00, 216.47it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity

100%|██████████████████████████████████████████| 79/79 [00:00<00:00, 251.36it/s]

My Doubts are :

  1. What is the meaning of this skipping entity (How can the span be None) .
  2. Is this a serious issue.
  3. How this can affect the performance and how we can overcome this ?
  4. If 100 datas are totally available, what ratio we can take for training and evaluating purpose ?

Code

import pandas as pd
import os
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model

db = DocBin() # create a DocBin object

for text, annot in tqdm(train_data): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

Solution

  • The span can be None if alignment_mode="contract" results in no marked tokens. So if you had a token good and tried to mark oo as a span with contract, then it would return None. With expand, you should always end up with at least one token.