Search code examples
pythonnlpformatconll

How to change from CoNLL format into a sentences list?


I have a txt file in, theoretically, CoNLL format. Like this:

a O
nivel B-INDC
de O
la O
columna B-ANAT
anterior I-ANAT
del I-ANAT
acetabulo I-ANAT


existiendo O
minimos B-INDC
cambios B-INDC
edematosos B-DISO
en O
la O
medular B-ANAT
(...)

I need to convert it into a list of sentence, but I don't find a way to do it. I tried with the parser of conllu library:

from conllu import parse
sentences = parse("location/train_data.txt")

but they give the error: ParseException: Invalid line format, line must contain either tabs or two spaces.

How can I get this?

["a nivel de la columna anterior del acetabulo", "existiendo minimos cambios edematosos en la medular", ...]

Thanks


Solution

  • for NLP Problems, the first starting point is Huggingface - always for me - :D There is a nice example for your problem: https://huggingface.co/transformers/custom_datasets.html

    Here they show a function that is exactly doing what you want:

    from pathlib import Path
    import re
    
    def read_wnut(file_path):
        file_path = Path(file_path)
    
        raw_text = file_path.read_text().strip()
        raw_docs = re.split(r'\n\t?\n', raw_text)
        token_docs = []
        tag_docs = []
        for doc in raw_docs:
            tokens = []
            tags = []
            for line in doc.split('\n'):
                token, tag = line.split('\t')
                tokens.append(token)
                tags.append(tag)
            token_docs.append(tokens)
            tag_docs.append(tags)
    
        return token_docs, tag_docs
    
    texts, tags = read_wnut("location/train_data.txt")