Search code examples
pythonstringlistnamed-entity-recognition

Text to word per line + named entity tag in Python


I’m making a Named Entity Recognizer and I’m struggling with putting data into the right format, using Python. What I have is a certain string and a list of the named entities in that text with belonging tags. For example:

text = “Hidden Figures is a 2016 American biographical drama film directed by Theodore Melfi and written by Melfi and Allison Schroeder.”

This string can also be “[[Hidden Figures]] is a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].” if that makes it easier.

listOfNEsAndTags = [‘Hidden Figures PRO’, 'American LOC’, 'Theodore Melfi PER’, 'Melfi PER’, 'Allison Schroeder PER’]

What I want as output is:

Hidden PRO
Figures PRO
is O
a O
2016 O
American LOC
biographical O
drama O
film O
directed O
by O
Theodore PER
Melfi PER
and O
written O
by O
Melfi PER
and O 
Allison PER
Schroeder PER 
. O

So far I’ve only gotten as far as the following function:

def wordPerLine(text, neplustags): 
    text = re.sub(r"([?!,.]+)", r" \1 ", text) 
    wpl = text.split() 
    output = [] 
    for line in wpl: 
        output.append(line + ” O") 
    return output

Which gives every line the default tag O (which is the tag for non-named entities). How can I make it so that the named entities in the text get the right tag?


Solution

  • This could work, replacing the print with something else and refinement of the regex is needed, but it's a good start.

    text = "[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]]."
    
    tags = {"Hidden test Figures": "PRO", "American": "LOC", 'Theodore Melfi': "PER", 'Melfi': "PER", 'Allison Schroeder': "PER"}
    
    text = re.sub(r"([?!,.]+)", r" \1", text)
    
    search = ""
    inTag = False
    
    for w in text.split(" "):
        outTag = False
    
        rest = w
    
        if rest[:2] == "[[":
            rest = rest[2:]
            inTag = True
        if rest[-2:] == "]]":
            rest = rest[:-2]
            outTag = True
    
        if inTag:
            search += rest
            if outTag:
                val = tags[search]
                for word in search.split():
                    print(word + ": " + val)
                inTag = False
                search = ""
            else:
                search += " "
        else:
            print(rest + ": O")
    

    Input:

    [[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].
    

    Output:

    Hidden: PRO
    test: PRO
    Figures: PRO
    is: O
    ,: O
    a: O
    2016: O
    American: LOC
    biographical: O
    drama: O
    film: O
    directed: O
    by: O
    Theodore: PER
    Melfi: PER
    and: O
    written: O
    by: O
    Melfi: PER
    and: O
    Allison: PER
    Schroeder: PER
    .: O