I've got a list of company names that I am trying to parse from a large number of PDF documents.
I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in.
I'm stuck trying to use some combination of FuzzyWuzzy and Spacy to extract the required matches.
This is as far as I've gotten:
import spacy
from fuzzywuzzy import fuzz, process
nlp = spacy.load("en_core_web_sm")
doc = nlp(strings[1])
companies = []
candidates = []
for ent in doc.ents:
if ent.label_ == "ORG":
candidates.append(ent.text)
process.extractBests(company_name, candidates, score_cutoff=80)
What I'm trying to do is:
Help!
This is the way I populated candidates
-- mpg
is a Pandas DataFrame:
for s in mpg['name'].values:
doc = nlp(s)
for ent in doc.ents:
if ent.label_ == 'ORG':
candidates.append(ent.text)
Then let's say we have a short list of car data just to test with:
candidates = ['buick'
,'buick skylark'
,'buick estate wagon'
,'buick century']
The below method uses fuzz.token_sort_ratio
which is described as "returning a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing." Try out some of the ones partially documented here: https://github.com/seatgeek/fuzzywuzzy/issues/137
results = {} # dictionary to store results
companies = ['buick'] # you'll have more companies
for company in companies:
results[company] = process.extractBests(company,candidates,
scorer=fuzz.token_sort_ratio,
score_cutoff=50)
And the results are:
In [53]: results
Out[53]: {'buick': [('buick', 100),
('buick skylark', 56),
('buick century', 56)]}
In this case using 80 as a cutoff score would work better than 50.