Search code examples
pythonspacynamed-entity-recognition

looping over unique entries


I have some labeled entities with text and am trying to get them into something SpaCy to use them to make an ner model. I am having trouble making a for loop to get entities within the same text to be in the same entry.

Example data: (df)

Text                              start     end     ent
Sara and Sam went to the park     0         4       Person
Sara and Sam went to the park     9         12      Person
Jake played on the swings         0         4       Person
The dog played with Tom           20        23      Person

My attempt at this is:

TRIAN = []
ENTS = []
for i in len(np.unique(df['Text'])[i]):
    text = df['Text'][i]
    for ii in range(len(df[df['Text'] == np.unique(df['Text'])[i]]]):
        Ent = [(df['start'][i + ii],[df['end'][i + ii],df['ent'][i + ii])]
        ENTS.append(Ent[i + ii])
        Results = [text[i], {'entities': ENTS.append(Ent[i + ii])}]
        TRAIN.append(Results)
print(TRAIN)

The desired output is: [[ "Sara and Sam went to the park", {"entities": [[0,4,"Person"], [9,12, "Person"]]}], ["Jake played on the swings", {"entities": [[0,4,"Person"]]}], ["The dog played with Tom", {entities": [[20,23,"Person"]]}]]

Any suggestions on how to fix my code to produce the desired output would be much appreciated.


Solution

  • The way your data is formatted is kind of weird and it's going to be kind of awkward to work with. You can do something like this. (I'm going to leave out dataframe manipulation because it's not relevant.)

    docs = []
    ents = []
    old = None # prior sentence
    for row in data:
        text, start, end, label = ... # split it somehow
        if text != old:
            # new doc, reset the ent buffer
            if old is not None:
                docs.append( [old, ents] )
            ents = []
            old = text
        ents.append( (start, end, label) )
    # clean up after the loop
    docs.append( [text, ents] )