Search code examples
pythonjsonspacydoccano

How to export "Document with entities from spaCy" for use in doccano


I want to train my model with doccano or an other "Open source text annotation tool" and continuously improve my model.

For that my understanding is, that I can import annotated data to doccano in a format described here: doccano import

So for a first step I have loaded a model and created a doc:

text = "Test text that should be annotated for Michael Schumacher" 
nlp = spacy.load('en_core_news_sm')
doc = nlp(text)

I know I can export the jsonl format (with text and annotated labels) from doccano and train a model with it but I want to know how to export that data from a spaCy doc in python so that i can import it to doccano.

Thanks in advance.


Solution

  • Doccano and/or spaCy seem to have changed things and there are now some flaws in the accepted answer. This revised version should be more correct with spaCy 3.1 and Doccano as of 8/1/2021...

    def text_to_doccano(text):
        """
        :text (str): source text
        Returns (list (dict)): deccano format json
        """
        djson = list()
        doc = nlp(text)
        for sent in doc.sents:
            labels = list()
            for e in sent.ents:
                labels.append([e.start_char - sent.start_char, e.end_char - sent.start_char, e.label_])
            djson.append({'text': sent.text, "label": labels})
        return djson
    

    The differences:

    1. labels becomes singular label in the JSON (?!?)
    2. e.start_char and e.end_char are actually (now?) the start and end within the document, not within the sentence...so you have to offset them by the position of the sentence within the document.