Search code examples
pythonnlpspacyspacy-3

spaCy How to initialize a Doc with entities in IOB format?


In my spaCy project, I would like to initialize a Doc object with text, labels and whitespaces. spaCy doesn't appreciate the way I provide the labels however, and shows its lack of appreciation in the following error message:

doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)
  File "spacy\tokens\doc.pyx", line 297, in spacy.tokens.doc.Doc.__init__
ValueError: [E177] Ill-formed IOB input detected: ('', 'O')

The code:

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")

token_texts = ["I", "like", "potatoes", "!"]
labels = [("", "O"), ("", "O"), ("food", "I"), ("", "O")]
whitespaces = [True, True, False, False]
doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)

Does anyone know how to exactly serve spaCy the entities on the silver platter?

The spaCy Doc documentation states

ents: A list of strings, of the same length of words, to assign the token-based IOB tag. Defaults to None. Optional[List[str]]

The type-hint List[str] made me attempt ["", "", "food", ""], which however results in the same error message.

Stackoverflow links that do not have the answer:

Convert NER SpaCy format to IOB format

Convert list of IOB formatted data to simple IOB formatted data

Failed to convert iob to spaCy binary format

Replace to entity tags to IOB format


Solution

  • IOB tags should be in the same format used in CoNLL files, so like "B-PERSON". So in your example code:

    labels = ["O", "O", "I-FOOD", "O"]