Search code examples
nlpspacynamed-entity-recognitiondoccano

How to convert Doccano exported JSONL format to spaCy format?


I want to use my own data set to train a named entity recognition model. The data set is exported by the annotation tool Doccano. The format is JSONL (not JSON), but spaCy does not support such a data format input model. How do I convert it? ? Here's what my dataset looks like:

{"id":17,"text":"In this work, the effect of CFs with zeolite on the mechanical, tribological prop-erties, and structure of PTFE was investigated. \nThe developed materials with a CF content of 1–5 wt.% retained their deformation and strength properties at the level of the initial polymer. \nThe compressive stress of PCM increased by 7–53%, and the yield point by 30% relative to the initial polymer. \nIt was found that with an increase in the content of fillers, the degree of crystallinity increased, and the density decreased in comparison with unfilled PTFE. \nCombining fillers (CF\/Zt) into PTFE reduced the wear rate by 810 times relative to the initial polymer. Tribochemical reactions were shown by IR spectroscopy.\nSEM established the formation of secondary structures in the form of tribofilms on the friction surface, which, together with CFs, protect the surface layer of the material from destruction during friction. \nThe wear resistance of the composite material PTFE\/CF\/Zt was effectively improved, and the coefficient of friction was low compared to PTFE\/CF\/Kl and PTFE\/CF\/Vl.","entities":[{"id":298,"label":"composite","start_offset":1049,"end_offset":1059},{"id":545,"label":"composite","start_offset":960,"end_offset":971},{"id":299,"label":"composite","start_offset":1064,"end_offset":1074},{"id":607,"label":"value","start_offset":176,"end_offset":184}],"relations":[],"Comments":[]}

I have also tried many online methods, but none of them seem to work.


Solution

  • You can modify the script given at explosion/projects/pipelines/ner_demo/scripts /convert.py:

    import json
    import warnings
    
    import spacy
    from spacy.tokens import DocBin
    
    def read_jsonl(fpath):
        with open(fpath, "r") as f:
            for line in f:
                yield json.loads(line)
    
    
    nlp = spacy.blank("en")
    doc_bin = DocBin()
    docs = []
    for data in read_jsonl("data.jsonl"):
        doc = nlp.make_doc(data["text"])
        ents = []
        for entity in data["entities"]:
            start = entity["start_offset"]
            end = entity["end_offset"]
            label = entity["label"]
            span = doc.char_span(
                start_idx=start,
                end_idx=end,
                label=label,
                alignment_mode="strict",
            )
            if span is None:
                msg = (
                    f"Skipping entity [{start}, {end}, {label}] in the "
                    "following text because the character span "
                    "'{doc.text[start:end]}' does not align with token "
                    "boundaries:\n\n{repr(text)}\n"
                )
                warnings.warn(msg)
            else:
                ents.append(span)
        doc.set_ents(entities=ents)
        doc_bin.add(doc)
    
    doc_bin.to_disk("train.spacy")
    

    The data format you have given looks a bit different from the Doccano format that I'm used to, but the above should work.