Search code examples
pythonnlpspacy-3conll

Convert spaCy `Doc` into CoNLL 2003 sample


I was planning to train a Spark NLP custom NER model, which uses the CoNLL 2003 format to do so (this blog even leaves some traning sample data to speed-up the follow-up). This "sample data" is NOT useful for me, as I have my own training data to train a model with; this data however, consists of a list of spaCy Doc objects and quite honestly, I don't know how to carry on with this conversion. I have found three approaches so far, each with some considerable weakness:

  1. In spaCy's documentation, I have found an example code about how to build a SINGLE Doc to CoNLL using spacy_conll project, but notice it uses a blank spacy model, so it is not clear where "my own labeled data" comes to play; furthermore, it seems conll_formatter component is "added at the end of the pipeline", so it seems "no direct conversion from Doc to CoNLL is actually done"... Is my grasping correct?

  2. In Prodigy forum (another product of the same designers of spaCy), I found this purposal, however that "CoNLL" (2003 I suppose?) format seems to be incomplete: the POS tag seems to be missing (which can be easily obtained via Token.pos_, as well as the "Syntactic chunk" (whose spaCy equivalent, does not seem to exist). These four fields are mentioned in CoNLL 2003 official documentation.

  3. Speaking of a "direct conversion from Doc to CoNLL", I have also found this implementation based on textacy library, but it seems this implementation got deprecated by version 0.11.0, because "CONLL-U [...] wasn't enforced or guaranteed" , so I am not sure whether to use it or not (BTW, the most up-to-date textacy implementation when writing these lines, is 0.12.0)

My current code looks like:

import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span

print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)

# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")

print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
    # Preliminary: whole sentence
    whole_sentence = doc.text
    # 1st item (CoNLL 2003): word
    words = [token.text for token in doc]
    # 2nd item (CoNLL 2003): POS
    pos = [token.tag_ for token in doc]
    # 3rd item (CoNLL 2003): syntactic chunk tag
    sct = ["[UNKNOWN]" for token in doc]
    # 4th item (CoNLL 2003): named entities
    spacy_entities = [
        (ent.start_char, ent.end_char, ent.label_)
        for ent in doc.ents
    ]
    biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
    results.append((whole_sentence, words, pos, sct, biluo_entities))

for result in results:
    print(
        "\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
        result[0], "\n"
    )
    print("-DOCSTART- -X- -X- O")
    for w,x,y,z in zip(result[1], result[2], result[3], result[4]):
        print(w,x,y,z)

# Pending: write to a file, but that's easy, and out of topic.

Which gives as output:

DOC TEXT (NOT included in CoNLL 2003, just for demo):  iPhone X is coming.

-DOCSTART- -X- -X- O
iPhone NNP [UNKNOWN] B-GADGET
X NNP [UNKNOWN] L-GADGET
is VBZ [UNKNOWN] O
coming VBG [UNKNOWN] O
. . [UNKNOWN] O

DOC TEXT (NOT included in CoNLL 2003, just for demo):  Space X is nice.

-DOCSTART- -X- -X- O
Space NNP [UNKNOWN] B-BRAND
X NNP [UNKNOWN] L-BRAND
is VBZ [UNKNOWN] O
nice JJ [UNKNOWN] O
. . [UNKNOWN] O

Have you done something like this before?

Thanks!


Solution

  • With @AlbertoAndreotti's help, I managed to get to a functional workaround:

    import spacy
    from spacy.training import offsets_to_biluo_tags
    from spacy.tokens import Span
    
    print("SPACY HELPER MODEL")
    base_model = "en_core_web_sm"
    nlp = spacy.load(base_model)
    to_disable= ['parser', 'lemmatizer', 'ner']
    _ = [nlp.remove_pipe(item) for item in to_disable]
    print("Base model used: ", base_model)
    print("Removed components: ", to_disable)
    print("Enabled components: ", nlp.pipe_names)
    
    # Assume text is already available as sentences...
    # so no need for spaCy `sentencizer` or similar
    print("\nDEMO SPACY DOC LIST BUILDING...", end="")
    doc1 = nlp("iPhone X is coming.")
    doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
    doc2 = nlp("Space X is nice.")
    doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
    docs = [doc1, doc2]
    print("DONE!")
    
    print("\nCoNLL 2003 CONVERSION:\n")
    results = []
    for doc in docs:
        # Preliminary: whole sentence
        whole_sentence = doc.text
        # 1st item (CoNLL 2003): word
        words = [token.text for token in doc]
        # 2nd item (CoNLL 2003): POS
        pos = [token.tag_ for token in doc]
        # 3rd item (CoNLL 2003): syntactic chunk tag
        # sct = pos  # Redundant, so will be left out
        # 4th item (CoNLL 2003): named entities
        spacy_entities = [
            (ent.start_char, ent.end_char, ent.label_)
            for ent in doc.ents
        ]
        biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
        results.append((whole_sentence, words, pos, biluo_entities))
    
    for result in results:
        print(
            "\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
            result[0], "\n"
        )
        print("-DOCSTART- -X- -X- O")
        for w,x,y,z in zip(result[1], result[2], result[2], result[3]):
            print(w,x,y,z)
    

    As complementary information, I found out that the 3rd missing item, "syntactic chunking tag", is related to a broader problem called "phrase chunking", that happens to be an unsolved problem in Computer Science, for which only aproximations have been got, so regardless of the library used, the conversion of that 3rd item specifically, into CoNLL 2033, might have errors. However, it seems Spark NLP does not care at all about 2nd & 3rd items, so the workaround suggested here, is acceptable.

    For more details, you might want to put an eye on this thread.