I am trying to convert non-IOB tags to IOB in a conllu file.
Two sample lines of the file would be:
2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=PER_23|Morph=nsf
3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=PER_23|Morph=nsf
And I would like to have
2 Ute Ute PROPN NE Case=Nom|Gender=Fem|Number=Sing 1 appos _ NE=B-PER|Morph=nsf
3 Wedemeier Wedemeier PROPN NE Case=Nom|Gender=Fem|Number=Sing 2 flat _ SpaceAfter=No|NE=I-PER|Morph=nsf
I now want to parse over the file, changing all occurring "NE=NamedEntityTag_Number" to IOB (the type isn't important, just each "NE=field_type_number (in the example "NE=PER_23") to (NE=B-PER and NE=I-PER). PER could be any field in in list_of_fields. Therefore, I created a list_of_fields with all named entity tags occurring. Since the conllu file is saved as a text file, I am parsing over a text file. Since not all lines contain named entity tags, I first check, whether a named entity tag is in the line, if so, I check, if the same tag (including the same number) is in the next line, and the line after that etc. This is important: when the next line contains the same annotation with the same number id, it belongs to the same entity, and therefore, the first must be B-PER, whereas the following of that row must be I-PER.
I am trying to use fileinput, just to change the part of the NE's.
Hope someone can help, thanks!
`
import fileinput
import re
list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]
with fileinput.FileInput(file, inplace=True, backup=".bak") as file:
for line in file:
ne = [annotation for annotation in list_of_fields if (annotation in line)]
if re.compile(r"^NE="+ne+"\_\d+$") in line:
if re.compile(r"^NE="+ne+"\_\d+$") in next(line) == re.compile(r"^NE="+ne+"\_\d+$") in line:
re.sub(r"^NE="+ne+"\_\d+$", r"NE=B-"+ne, line)
re.sub(r"^NE="+ne+"\_\d+$", r"NE=I-"+ne, next(line))
else:
re.sub(r"^NE=" + ne + "\_\d+$", r"NE=B-" + ne, line)`
You have to save the last field and last value to compare it across multiple lines. If either differs with the next one, you do the replacement with B-<field>
and otherwise with I-<field>
:
import fileinput
import re
list_of_fields = ["PER", "ORG", "LOC", "GPE", "OTH"]
joined_fields = f'({"|".join(list_of_fields)})'
field_pattern = re.compile(f'NE={joined_fields}')
last_field = last_value = None
with fileinput.FileInput(file, inplace=True, backup=".bak") as in_file,
open('output.txt', 'wt') as out_file:
for line in in_file:
matches = re.findall(field_pattern, line)
if not matches:
# keep input
out_file.write(line)
continue
field = matches[0] # assuming only one field per line
start_index = line.find(f'NE={field}')
end_index = line.find('|', start_index)
value = re.findall(rf'{field}_(\d+)', line[start_index:end_index])[0]
if field != last_field or value != last_value:
replacement = f'B-{field}'
else:
replacement = f'I-{field}'
last_field = field
last_value = value
new_line = re.sub(rf'{field}_{value}(-{joined_fields}_\d+)*', replacement, line)
out_file.write(new_line)
EDIT: allowed for multiple fields, using only the first one