Search code examples
pythonpandasdataframenlpconll

Converting pandas dataframe to CoNLL


I have a processed dataframe which is used as a input to train a NLP model:

 sentence_id    words   labels
0   0            a      B-ORG
1   0            b      I-ORG
2   0            c      I-ORG
5   1            d      B-ORG
6   1            e      I-ORG
7   2            f      B-PER
8   2            g      I-PER

I need to convert this into ConLL text format as below:

a B-ORG
b I-ORG
c I-ORG

d B-ORG
e I-ORG

f B-PER
g I-PER

The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.

Anyone have any idea how to do that?


Solution

  • First join both columns by space anf then in DataFrame.groupby add last empty value with write to file:

    df['join'] = df['words'] + ' ' + df['labels']
    #alternative
    #df['join'] = df['words'].str.cat(df['labels'], sep=' ')
    for i, g in df.groupby('sentence_id')['join']:
        out = g.append(pd.Series({'new':np.nan}))
        out.to_csv('file.txt', index=False, header=None, mode='a')