I want to train my model with doccano or an other "Open source text annotation tool" and continuously improve my model.
For that my understanding is, that I can import annotated data to doccano in a format described here:
So for a first step I have loaded a model and created a doc:
text = "Test text that should be annotated for Michael Schumacher"
nlp = spacy.load('en_core_news_sm')
doc = nlp(text)
I know I can export the jsonl format (with text and annotated labels) from doccano and train a model with it but I want to know how to export that data from a spaCy doc in python so that i can import it to doccano.
Thanks in advance.
Doccano and/or spaCy seem to have changed things and there are now some flaws in the accepted answer. This revised version should be more correct with spaCy 3.1 and Doccano as of 8/1/2021...
def text_to_doccano(text):
"""
:text (str): source text
Returns (list (dict)): deccano format json
"""
djson = list()
doc = nlp(text)
for sent in doc.sents:
labels = list()
for e in sent.ents:
labels.append([e.start_char - sent.start_char, e.end_char - sent.start_char, e.label_])
djson.append({'text': sent.text, "label": labels})
return djson
The differences:
labels
becomes singular label
in the JSON (?!?)e.start_char
and e.end_char
are actually (now?) the start and end within the document, not within the sentence...so you have to offset them by the position of the sentence within the document.