How to change from CoNLL format into a sentences list?

I have a txt file in, theoretically, CoNLL format. Like this:

a O
nivel B-INDC
de O
la O
columna B-ANAT
anterior I-ANAT
del I-ANAT
acetabulo I-ANAT


existiendo O
minimos B-INDC
cambios B-INDC
edematosos B-DISO
en O
la O
medular B-ANAT
(...)

I need to convert it into a list of sentence, but I don't find a way to do it. I tried with the parser of conllu library:

from conllu import parse
sentences = parse("location/train_data.txt")

but they give the error: ParseException: Invalid line format, line must contain either tabs or two spaces.

How can I get this?

["a nivel de la columna anterior del acetabulo", "existiendo minimos cambios edematosos en la medular", ...]

Thanks

Solution

for NLP Problems, the first starting point is Huggingface - always for me - :D There is a nice example for your problem: https://huggingface.co/transformers/custom_datasets.html

Here they show a function that is exactly doing what you want:

from pathlib import Path
import re

def read_wnut(file_path):
    file_path = Path(file_path)

    raw_text = file_path.read_text().strip()
    raw_docs = re.split(r'\n\t?\n', raw_text)
    token_docs = []
    tag_docs = []
    for doc in raw_docs:
        tokens = []
        tags = []
        for line in doc.split('\n'):
            token, tag = line.split('\t')
            tokens.append(token)
            tags.append(tag)
        token_docs.append(tokens)
        tag_docs.append(tags)

    return token_docs, tag_docs

texts, tags = read_wnut("location/train_data.txt")