Search code examples
pythonspacynamed-entity-recognitionspacy-3

Train two consecutive NER pipes in Spacy


I am working on a project to train a classifier to identify citations in a text. The citations we are dealing with tend be very disorganized. Below are some example citations:

  • See Book A Chapter 3 Paragraph 7
  • See Paragraph 7 in Chapter 3 of Book A
  • See Chapter "Some Chapter Title" of Book A, Paragraph 7

We have identified a small number of entities that tend to appear in these citations. For example, "book title", "chapter number", "chapter name", "paragraph number".

The project has two stages:

  1. Binary classification of the citation in the text
  2. Classification of citation entities within the citation

Is it possible with Spacy (we're using v3) to have two consecutive NER pipes? I would want the classifier to first tag the citations and only then tag the entities within each citation.

I was able to instantiate a model with two NER pipes with the below code:

from spacy.lang.en import English
nlp = English()
nlp.add_pipe("ner", name="ner1", last=True)
ner1 = nlp.get_pipe("ner1")
ner1.add_label("Citation")
nlp.add_pipe("ner", name="ner2", last=True)
ner2 = nlp.get_pipe("ner2")
for label in ["Book Title", "Chapter Number", "Chapter Name", "Paragraph Number"]:
    ner2.add_label(label)

My question is how to train each NER pipe separately. Normally, Spacy requires data in the following shape to train NER:

{
    "text": <TEXT>,
    "spans": [<LIST OF NAMED ENTITY SPANS>]
}

How can I distinguish data for each pipe in my training data?


Solution

  • There are several parts to this.

    1. You can have two NER components in one spaCy pipeline, though because of issues 2 and 3 this isn't going to work the way you want it to.
    2. Pipelines cannot set annotations during training for downstream components. This is a limitation that is being worked on and should be resolved soon.
    3. NER annotations cannot be overlapping. This is a design decision and is not going to change soon. It can be worked around with a custom component but it's extra work.

    I would want the classifier to first tag the citations and only then tag the entities within each citation.

    Do you actually need the whole citation tag separately or are you designing this as a two-stage process to improve performance for some reason? If it's the latter, I would just try training on the second-stage detailed annotations first and see if you actually have a problem; I'm doubtful a two-stage process would actually make things easier.

    If you actually need the whole "citation" then you can just extract chains of the detailed entities into a single span, there's no need to have a separate model for that.

    I recommend you take a good look at the section on Combining Models and Rules in the docs. It has examples like expanding personal names to include titles like Mr. or Dr., or using dependency parse info, that seem applicable to your problem.