python spacy huggingface-transformers named-entity-recognition

NER using spaCy & Transformers - different result when running inside and outside of a loop

I am using NER (spacy & Transformer) for finding and anonymizing personal information. I noticed that the output I get when giving an input line directly is different than when the input line is read from a file (see screenshot below). Does anyone have suggestions on how to fix this?

Here is my code:

import pandas as pd
import csv
import spacy
from spacy import displacy
from transformers import pipeline
import re

!python -m spacy download en_core_web_trf
nlp = spacy.load('en_core_web_trf')

sent = nlp('Yesterday I went out with Andrew, johanna and Jonathan Sparow.')
displacy.render(sent, style = 'ent')

with open('Synth_dataset_raw.txt', 'r') as fd:
    reader = csv.reader(fd)
    for row in reader:
        sent = nlp(str(row))
        displacy.render(sent, style = 'ent')

Solution

You are using the csv module to read your file and then trying to convert each row (aka line) of the file to a string with str(row).

If your file just has one sentence per line, then you do not need the csv module at all. You could just do

with open('Synth_dataset_raw.txt', 'r') as fd:
    for line in fd:
        # Remove the trailing newline
        line = line.rstrip()
        sent = nlp(line)
        displacy.render(sent, style = 'ent')

If you in fact have a csv (with presumably multiple columns and a header) you could do

open('Synth_dataset_raw.txt', 'r') as fd:
    reader = csv.reader(fd)
    header = next(reader)
    text_column_index = 0
    for row in reader:
        sent = nlp(row[text_column_index])
        displacy.render(sent, style = 'ent')