Search code examples
pythonpandasnlpspacynamed-entity-recognition

Label custom NER in pandas dataframe


I have a dataframe with 3 columns: 'text', 'in', 'tar' of type(str, list, list) respectively.

                   text                                       in       tar
0  This is an example text that I use in order to  ...       [2]       [6]
1  Discussion: We are examining the possibility of ...       [3]     [6, 7]

in and tar represent specific entities that I want to tag into the text, and they return the position of each found entity term in the text.

For example, at the 2nd row of the dataframe where in = [3], I want to take the 3rd word from text column (i.e.: "are") and label it as <IN>are</IN>.

Similarly, for the same row, since tar = [6,7], I also want to take the 6th and 7th word from text column (i.e. "possibility", "of") and label them as <TAR>possibility</TAR>, <TAR>of</TAR>.

Can someone help me how to do this?


Solution

  • This is not the most optimal implementation but is worth getting inspiration.

    data = {'text': ['This is an example text that I use in order to',
                     'Discussion: We are examining the possibility of the'],
            'in': [[2], [3]],
            'tar': [[6], [6, 7]]}
    df = pd.DataFrame(data)
    cols = list(df.columns)[1:]
    new_text = []
    for idx, row in df.iterrows():
        temp = list(row['text'].split())
        for pos, word in enumerate(temp):
            for col in cols:
                if pos in row[col]:
                    temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
        new_text.append(' '.join(temp))
    df['text'] = new_text
    print(df.text.to_list())
    

    output:

    ['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to', 
     'Discussion: We are <IN>examining</IN> the possibility <TAR>of</TAR> <TAR>the</TAR>']
    

    UPDATE 1

    Merging consecutive occurrence of the similar tags can be done like below:

    data = {'text': ['This is an example text that I use in order to',
                     'Discussion: We are examining the possibility of the'],
            'in': [[2], [3, 4, 5]],
            'tar': [[6], [6, 7]]}
    df = pd.DataFrame(data)
    cols = list(df.columns)[1:]
    new_text = []
    for idx, row in df.iterrows():
        temp = list(row['text'].split())
        for pos, word in enumerate(temp):
            for col in cols:
                if pos in row[col]:
                    temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
        new_text.append(' '.join(temp))
        
    df['text'] = new_text
    for col in cols:
        df['text'] = df['text'].apply(lambda text:text.replace("</"+col.upper()+"> <"+col.upper()+">", " "))
    print(df.text.to_list())
    

    output:

    ['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to', 'Discussion: We are <IN>examining the possibility</IN> <TAR>of the</TAR>']