Search code examples
python-3.xregextext-parsingnamed-entity-recognition

Replace to entity tags to IOB format


I am trying to convert non-IOB tags to IOB in a conllu file.

Two sample lines of the file would be:

2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=PER_23|Morph=nsf

3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=PER_23|Morph=nsf

And I would like to have

2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=B-PER|Morph=nsf

3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=I-PER|Morph=nsf

I now want to parse over the file, changing all occurring "NE=NamedEntityTag_Number" to IOB (the type isn't important, just each "NE=field_type_number (in the example "NE=PER_23") to (NE=B-PER and NE=I-PER). PER could be any field in in list_of_fields. Therefore, I created a list_of_fields with all named entity tags occurring. Since the conllu file is saved as a text file, I am parsing over a text file. Since not all lines contain named entity tags, I first check, whether a named entity tag is in the line, if so, I check, if the same tag (including the same number) is in the next line, and the line after that etc. This is important: when the next line contains the same annotation with the same number id, it belongs to the same entity, and therefore, the first must be B-PER, whereas the following of that row must be I-PER.

I am trying to use fileinput, just to change the part of the NE's.

Hope someone can help, thanks!

`

import fileinput

import re

list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]

with fileinput.FileInput(file, inplace=True, backup=".bak") as file:
    for line in file:
        ne = [annotation for annotation in list_of_fields if (annotation in line)]
        if re.compile(r"^NE="+ne+"\_\d+$") in line:
            if re.compile(r"^NE="+ne+"\_\d+$") in next(line) == re.compile(r"^NE="+ne+"\_\d+$") in line:
                re.sub(r"^NE="+ne+"\_\d+$", r"NE=B-"+ne, line)
                re.sub(r"^NE="+ne+"\_\d+$", r"NE=I-"+ne, next(line))
            else:
                re.sub(r"^NE=" + ne + "\_\d+$", r"NE=B-" + ne, line)`

Solution

  • You have to save the last field and last value to compare it across multiple lines. If either differs with the next one, you do the replacement with B-<field> and otherwise with I-<field>:

    import fileinput
    import re
    
    list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]
    joined_fields = f'({"|".join(list_of_fields)})'
    field_pattern = re.compile(f'NE={joined_fields}')
    last_field = last_value = None
    
    with fileinput.FileInput(file, inplace=True, backup=".bak") as in_file,
         open('output.txt', 'wt') as out_file:
    
        for line in in_file:
            matches = re.findall(field_pattern, line)
            if not matches:
                # keep input
                out_file.write(line)
                continue
            field = matches[0] # assuming only one field per line
            start_index = line.find(f'NE={field}')
            end_index = line.find('|', start_index)
            value = re.findall(rf'{field}_(\d+)', line[start_index:end_index])[0]
            if field != last_field or value != last_value:
                replacement = f'B-{field}'
            else:
                replacement = f'I-{field}'
            last_field = field
            last_value = value
            new_line = re.sub(rf'{field}_{value}(-{joined_fields}_\d+)*', replacement, line)
            out_file.write(new_line)
    

    EDIT: allowed for multiple fields, using only the first one