Search code examples
pythonnlpspacynamed-entity-recognition

How to use Spacy nlp custom ner to identity 2 types of docs at once


I want to make a SPACY ner model that identifies and uses tags depending on what doc type it is.

The input is in json format. Example-

{"text":{"a":"ABC DEF.","b":"CDE FG."},
  "annotations":[
    {"start":0,"end":3,"doc_type":"a","label":{"text":"FIRST"},"text":"ABC"}, 
    {"start":4,"end":6,"doc_type":"b","label":{"text":"SECOND"},"text":"FG"}
  ]
}

In this I want the model to identify that the 1st text is of type "a" so the text should be tagged with tag FIRST. Similarly second text is of type "b" so it the ner must be SECOND

How can I go about this problem? Thanks!


Solution

  • The description of your data is a little vague but given these assumptions:

    1. You don't know if a document is type A or type B, you need to classify it.
    2. The NER is completely different between type A and B documents.

    What you should do is use (up to) three separate spaCy pipelines. Use the first pipeline with a textcat model to classify docs into A and B types, and then have one pipeline for NER for type A docs and one pipeline for type B docs. After classification just pass the text to the appropriate NER pipeline.

    This is not the most efficient possible pipeline, but it's very easy to set up - you just train three separate models and stick them together with a little glue code.

    You could also train the models separately and combine them in one spaCy pipeline, with some kind of special component to make execution of the NER conditional, but that would be pretty tricky to set up so I'd recommend the separate pipelines approach first.

    That said, depending on your problem it's possible that you don't need two NER models, and learning entities for both types of docs would be effective. So I would also recommend you try putting all your training data together, training just one NER model, and seeing how it goes. If that works then you can have a single pipeline with textcat and NER models that don't directly interact with each other.


    To respond to the comment, when I say "pipeline" I mean a Language object, which is what spacy.load returns. So you train models using the config and each of those is in a directory and then you do this:

    import spacy
    
    classifier = spacy.load("classifier")
    ner_a = spacy.load("ner_a")
    ner_b = spacy.load("ner_b")
    
    texts = ["I like cheese", ... raw texts ... ]
    
    for text in texts:
        doc = classifier(text)
        if doc.cats["a"] > doc.cats["b"]:
            nerdoc = ner_a(text)
        else:
            nerdoc = ner_b(text)
        ... do something with the doc here ...