I am working on a project to train a classifier to identify citations in a text. The citations we are dealing with tend be very disorganized. Below are some example citations:
We have identified a small number of entities that tend to appear in these citations. For example, "book title", "chapter number", "chapter name", "paragraph number".
The project has two stages:
Is it possible with Spacy (we're using v3) to have two consecutive NER pipes? I would want the classifier to first tag the citations and only then tag the entities within each citation.
I was able to instantiate a model with two NER pipes with the below code:
from spacy.lang.en import English
nlp = English()
nlp.add_pipe("ner", name="ner1", last=True)
ner1 = nlp.get_pipe("ner1")
ner1.add_label("Citation")
nlp.add_pipe("ner", name="ner2", last=True)
ner2 = nlp.get_pipe("ner2")
for label in ["Book Title", "Chapter Number", "Chapter Name", "Paragraph Number"]:
ner2.add_label(label)
My question is how to train each NER pipe separately. Normally, Spacy requires data in the following shape to train NER:
{
"text": <TEXT>,
"spans": [<LIST OF NAMED ENTITY SPANS>]
}
How can I distinguish data for each pipe in my training data?
There are several parts to this.
I would want the classifier to first tag the citations and only then tag the entities within each citation.
Do you actually need the whole citation tag separately or are you designing this as a two-stage process to improve performance for some reason? If it's the latter, I would just try training on the second-stage detailed annotations first and see if you actually have a problem; I'm doubtful a two-stage process would actually make things easier.
If you actually need the whole "citation" then you can just extract chains of the detailed entities into a single span, there's no need to have a separate model for that.
I recommend you take a good look at the section on Combining Models and Rules in the docs. It has examples like expanding personal names to include titles like Mr. or Dr., or using dependency parse info, that seem applicable to your problem.