I have a txt file in, theoretically, CoNLL format. Like this:
a O
nivel B-INDC
de O
la O
columna B-ANAT
anterior I-ANAT
del I-ANAT
acetabulo I-ANAT
existiendo O
minimos B-INDC
cambios B-INDC
edematosos B-DISO
en O
la O
medular B-ANAT
(...)
I need to convert it into a list of sentence, but I don't find a way to do it. I tried with the parser of conllu library:
from conllu import parse
sentences = parse("location/train_data.txt")
but they give the error: ParseException: Invalid line format, line must contain either tabs or two spaces.
How can I get this?
["a nivel de la columna anterior del acetabulo", "existiendo minimos cambios edematosos en la medular", ...]
Thanks
for NLP Problems, the first starting point is Huggingface - always for me - :D There is a nice example for your problem: https://huggingface.co/transformers/custom_datasets.html
Here they show a function that is exactly doing what you want:
from pathlib import Path
import re
def read_wnut(file_path):
file_path = Path(file_path)
raw_text = file_path.read_text().strip()
raw_docs = re.split(r'\n\t?\n', raw_text)
token_docs = []
tag_docs = []
for doc in raw_docs:
tokens = []
tags = []
for line in doc.split('\n'):
token, tag = line.split('\t')
tokens.append(token)
tags.append(tag)
token_docs.append(tokens)
tag_docs.append(tags)
return token_docs, tag_docs
texts, tags = read_wnut("location/train_data.txt")