Search code examples
pythonpython-3.xspacynamed-entity-recognition

Mapping entity IDs to strings in SpaCy 3.0


I have trained a simple NER pipeline using spacy 3.0. After training I want to get a list of predicted IOB tags, among other things from a Doc (doc = nlp(text)). For example, ["O", "O", "B", "I", "O"]

I can easily get the IOB ids (integers) using

>> doc.to_array("ENT_IOB")
array([2, 2, ..., 2], dtype=uint64)

But how can I get the mappings/lookup?

I didn't find any lookup tables in doc.vocab.lookups.tables.

I also understand that I can achieve the same effect by accessing the ent_iob_ at each token ([token.ent_iob_ for token in doc]), but I was wondering if there is a better way?


Solution

  • Check the token documentation:

    • ent_iob IOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set.
    • ent_iob_ IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.

    So, all you need is to map the ids to the names using a simple iob_map = {0: "", 1: "I", 2: "O", 3: "B"} dictionary replacement:

    doc = nlp("John went to New York in 2010.")
    print([x.text for x in doc.ents])
    # => ['John', 'New York', '2010']
    iob_map = {0: "", 1: "I", 2: "O", 3: "B"}
    print(list(map(iob_map.get, doc.to_array("ENT_IOB").tolist())))
    # => ['B', 'O', 'O', 'B', 'I', 'O', 'B', 'O']