Search code examples
pythonpython-3.xspacyapache-tikafuzzywuzzy

Fuzzy matching from string candidate list


I've got a list of company names that I am trying to parse from a large number of PDF documents.

I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in.

I'm stuck trying to use some combination of FuzzyWuzzy and Spacy to extract the required matches.

This is as far as I've gotten:

import spacy
from fuzzywuzzy import fuzz, process

nlp = spacy.load("en_core_web_sm")
doc = nlp(strings[1])

companies = []
candidates = []

for ent in doc.ents:
  if ent.label_ == "ORG":
    candidates.append(ent.text)

process.extractBests(company_name, candidates, score_cutoff=80)

What I'm trying to do is:

  1. Read through the document string
  2. Parse for any fuzzy company name matches scoring say 80+
  3. Return company names that are contained in the document and their scores.

Help!


Solution

  • This is the way I populated candidates -- mpg is a Pandas DataFrame:

    for s in mpg['name'].values: 
        doc = nlp(s) 
        for ent in doc.ents: 
            if ent.label_ == 'ORG': 
                candidates.append(ent.text) 
    

    Then let's say we have a short list of car data just to test with:

    candidates = ['buick'
                 ,'buick skylark'
                 ,'buick estate wagon'
                 ,'buick century']
    

    The below method uses fuzz.token_sort_ratio which is described as "returning a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing." Try out some of the ones partially documented here: https://github.com/seatgeek/fuzzywuzzy/issues/137

    results = {} # dictionary to store results 
    companies = ['buick'] # you'll have more companies
    for company in companies:
        results[company] = process.extractBests(company,candidates,
                                                scorer=fuzz.token_sort_ratio,
                                                score_cutoff=50)
    

    And the results are:

    In [53]: results                                                                
    Out[53]: {'buick': [('buick', 100), 
                        ('buick skylark', 56), 
                        ('buick century', 56)]}
    

    In this case using 80 as a cutoff score would work better than 50.