I am trying to create doc bin object for Custom NER. I have around 100 tagged datas for training (Just as a start)
I am getting skipping entity message while creation.
54%|██████████████████████▊ | 43/79 [00:00<00:00, 216.47it/s]
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
100%|██████████████████████████████████████████| 79/79 [00:00<00:00, 251.36it/s]
My Doubts are :
Code
import pandas as pd
import os
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(train_data): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
db.to_disk("./train.spacy") # save the docbin object
The span can be None
if alignment_mode="contract"
results in no marked tokens. So if you had a token good
and tried to mark oo
as a span with contract
, then it would return None
. With expand
, you should always end up with at least one token.