I’m making a Named Entity Recognizer and I’m struggling with putting data into the right format, using Python. What I have is a certain string and a list of the named entities in that text with belonging tags. For example:
text = “Hidden Figures is a 2016 American biographical drama film directed by Theodore Melfi and written by Melfi and Allison Schroeder.”
This string can also be “[[Hidden Figures]] is a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].” if that makes it easier.
listOfNEsAndTags = [‘Hidden Figures PRO’, 'American LOC’, 'Theodore Melfi PER’, 'Melfi PER’, 'Allison Schroeder PER’]
What I want as output is:
Hidden PRO
Figures PRO
is O
a O
2016 O
American LOC
biographical O
drama O
film O
directed O
by O
Theodore PER
Melfi PER
and O
written O
by O
Melfi PER
and O
Allison PER
Schroeder PER
. O
So far I’ve only gotten as far as the following function:
def wordPerLine(text, neplustags):
text = re.sub(r"([?!,.]+)", r" \1 ", text)
wpl = text.split()
output = []
for line in wpl:
output.append(line + ” O")
return output
Which gives every line the default tag O (which is the tag for non-named entities). How can I make it so that the named entities in the text get the right tag?
This could work, replacing the print with something else and refinement of the regex is needed, but it's a good start.
text = "[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]]."
tags = {"Hidden test Figures": "PRO", "American": "LOC", 'Theodore Melfi': "PER", 'Melfi': "PER", 'Allison Schroeder': "PER"}
text = re.sub(r"([?!,.]+)", r" \1", text)
search = ""
inTag = False
for w in text.split(" "):
outTag = False
rest = w
if rest[:2] == "[[":
rest = rest[2:]
inTag = True
if rest[-2:] == "]]":
rest = rest[:-2]
outTag = True
if inTag:
search += rest
if outTag:
val = tags[search]
for word in search.split():
print(word + ": " + val)
inTag = False
search = ""
else:
search += " "
else:
print(rest + ": O")
Input:
[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].
Output:
Hidden: PRO
test: PRO
Figures: PRO
is: O
,: O
a: O
2016: O
American: LOC
biographical: O
drama: O
film: O
directed: O
by: O
Theodore: PER
Melfi: PER
and: O
written: O
by: O
Melfi: PER
and: O
Allison: PER
Schroeder: PER
.: O