Search code examples
nlpspacydependency-parsingconll

Converting Spacy generated dependency into CoNLL format cannot handle more than one ROOT?


I used the SpaCy library to generate dependencies and save it into a CoNLL format using the code below.

import pandas as pd
import spacy

df1 = pd.read_csv('cleantweets', encoding='latin1')
df1['tweet'] = df1['tweet'].astype(str)
tweet_list = df1['tweet'].values.tolist()
nlp = spacy.load("en_core_web_sm")
for i in tweet_list:
    doc = nlp(i)
    for sent in doc.sents:
        print('\n')
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                 head_idx = doc[i].head.i + 1
            print("%d\t%s\t%d\t%s\t%s\t%s" % (
                i+1, 
                word.head,
                head_idx,
                word.text,
                word.dep_,
                word.pos_, 
                ))

This works, but there are some sentences in my dataset that get splits into two by Spacy because they have two ROOTS. This results in having two fields for one sentence in the CoNLL format.

Example: A random sentence from my dataset is : "teanna trump probably cleaner twitter hoe but"

in CoNLL format it is saved as :

    1   trump   2   teanna      compound
    2   cleaner 4   trump       nsubj
    3   cleaner 4   probably    advmod
    4   cleaner 4   cleaner     ROOT
    5   hoe     6   twitter     amod
    6   cleaner 4   hoe         dobj


    1   but 2   but ROOT

Is there a way to save it all in one field instead of two even though it has two ROOTS so that 'but' becomes 7th item in field number 1? Which means it would look like this instead

    1   trump   2   teanna      compound
    2   cleaner 4   trump       nsubj
    3   cleaner 4   probably    advmod
    4   cleaner 4   cleaner     ROOT
    5   hoe     6   twitter     amod
    6   cleaner 4   hoe         dobj
    7   but     2   but         ROOT

Solution

  • I'd recommend using (or adapting) the textacy CoNLL exporter to get the right format, see: How to generate .conllu from a Doc object?

    Spacy's parser is doing sentence segmentation and you're iterating over doc.sents, so you'll see each sentence it exported separately. If you want to provide your own sentence segmentation, you can do that with a custom component, e.g.:

    def set_custom_boundaries(doc):
        for token in doc[:-1]:
            if token.text == "...":
                doc[token.i+1].is_sent_start = True
        return doc
    
    nlp.add_pipe(set_custom_boundaries, before="parser")
    

    Details (especially about how to handle None vs. False vs. True): https://spacy.io/usage/linguistic-features#sbd-custom

    Spacy's default models aren't trained on twitter-like text, the parser probably won't perform well with respect to sentence boundaries here.

    (Please ask unrelated questions as separate questions, and also take a look at spacy's docs: https://spacy.io/usage/linguistic-features#special-cases)