I am using NER (spacy & Transformer) for finding and anonymizing personal information. I noticed that the output I get when giving an input line directly is different than when the input line is read from a file (see screenshot below). Does anyone have suggestions on how to fix this?
Here is my code:
import pandas as pd
import csv
import spacy
from spacy import displacy
from transformers import pipeline
import re
!python -m spacy download en_core_web_trf
nlp = spacy.load('en_core_web_trf')
sent = nlp('Yesterday I went out with Andrew, johanna and Jonathan Sparow.')
displacy.render(sent, style = 'ent')
with open('Synth_dataset_raw.txt', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
sent = nlp(str(row))
displacy.render(sent, style = 'ent')
You are using the csv module to read your file and then trying to convert each row (aka line) of the file to a string with str(row)
.
If your file just has one sentence per line, then you do not need the csv module at all. You could just do
with open('Synth_dataset_raw.txt', 'r') as fd:
for line in fd:
# Remove the trailing newline
line = line.rstrip()
sent = nlp(line)
displacy.render(sent, style = 'ent')
If you in fact have a csv (with presumably multiple columns and a header) you could do
open('Synth_dataset_raw.txt', 'r') as fd:
reader = csv.reader(fd)
header = next(reader)
text_column_index = 0
for row in reader:
sent = nlp(row[text_column_index])
displacy.render(sent, style = 'ent')