python-3.x regex text-parsing named-entity-recognition

Replace to entity tags to IOB format

I am trying to convert non-IOB tags to IOB in a conllu file.

Two sample lines of the file would be:

2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=PER_23|Morph=nsf

3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=PER_23|Morph=nsf

And I would like to have

2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=B-PER|Morph=nsf

3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=I-PER|Morph=nsf

I now want to parse over the file, changing all occurring "NE=NamedEntityTag_Number" to IOB (the type isn't important, just each "NE=field_type_number (in the example "NE=PER_23") to (NE=B-PER and NE=I-PER). PER could be any field in in list_of_fields. Therefore, I created a list_of_fields with all named entity tags occurring. Since the conllu file is saved as a text file, I am parsing over a text file. Since not all lines contain named entity tags, I first check, whether a named entity tag is in the line, if so, I check, if the same tag (including the same number) is in the next line, and the line after that etc. This is important: when the next line contains the same annotation with the same number id, it belongs to the same entity, and therefore, the first must be B-PER, whereas the following of that row must be I-PER.

I am trying to use fileinput, just to change the part of the NE's.

Hope someone can help, thanks!

import fileinput

import re

list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]

with fileinput.FileInput(file, inplace=True, backup=".bak") as file:
    for line in file:
        ne = [annotation for annotation in list_of_fields if (annotation in line)]
        if re.compile(r"^NE="+ne+"\_\d+$") in line:
            if re.compile(r"^NE="+ne+"\_\d+$") in next(line) == re.compile(r"^NE="+ne+"\_\d+$") in line:
                re.sub(r"^NE="+ne+"\_\d+$", r"NE=B-"+ne, line)
                re.sub(r"^NE="+ne+"\_\d+$", r"NE=I-"+ne, next(line))
            else:
                re.sub(r"^NE=" + ne + "\_\d+$", r"NE=B-" + ne, line)`

Solution

You have to save the last field and last value to compare it across multiple lines. If either differs with the next one, you do the replacement with B-<field> and otherwise with I-<field>:

import fileinput
import re

list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]
joined_fields = f'({"|".join(list_of_fields)})'
field_pattern = re.compile(f'NE={joined_fields}')
last_field = last_value = None

with fileinput.FileInput(file, inplace=True, backup=".bak") as in_file,
     open('output.txt', 'wt') as out_file:

    for line in in_file:
        matches = re.findall(field_pattern, line)
        if not matches:
            # keep input
            out_file.write(line)
            continue
        field = matches[0] # assuming only one field per line
        start_index = line.find(f'NE={field}')
        end_index = line.find('|', start_index)
        value = re.findall(rf'{field}_(\d+)', line[start_index:end_index])[0]
        if field != last_field or value != last_value:
            replacement = f'B-{field}'
        else:
            replacement = f'I-{field}'
        last_field = field
        last_value = value
        new_line = re.sub(rf'{field}_{value}(-{joined_fields}_\d+)*', replacement, line)
        out_file.write(new_line)

EDIT: allowed for multiple fields, using only the first one