Search code examples
pythonspacyhuggingface-transformersnamed-entity-recognition

NER using spaCy & Transformers - different result when running inside and outside of a loop


I am using NER (spacy & Transformer) for finding and anonymizing personal information. I noticed that the output I get when giving an input line directly is different than when the input line is read from a file (see screenshot below). Does anyone have suggestions on how to fix this?

enter image description here

Here is my code:

import pandas as pd
import csv
import spacy
from spacy import displacy
from transformers import pipeline
import re

!python -m spacy download en_core_web_trf
nlp = spacy.load('en_core_web_trf')

sent = nlp('Yesterday I went out with Andrew, johanna and Jonathan Sparow.')
displacy.render(sent, style = 'ent')

with open('Synth_dataset_raw.txt', 'r') as fd:
    reader = csv.reader(fd)
    for row in reader:
        sent = nlp(str(row))
        displacy.render(sent, style = 'ent')

Solution

  • You are using the csv module to read your file and then trying to convert each row (aka line) of the file to a string with str(row).

    If your file just has one sentence per line, then you do not need the csv module at all. You could just do

    with open('Synth_dataset_raw.txt', 'r') as fd:
        for line in fd:
            # Remove the trailing newline
            line = line.rstrip()
            sent = nlp(line)
            displacy.render(sent, style = 'ent')
    

    If you in fact have a csv (with presumably multiple columns and a header) you could do

    open('Synth_dataset_raw.txt', 'r') as fd:
        reader = csv.reader(fd)
        header = next(reader)
        text_column_index = 0
        for row in reader:
            sent = nlp(row[text_column_index])
            displacy.render(sent, style = 'ent')