Search code examples
postgresqllevenshtein-distancefuzzy-search

PostgreSQL Fuzzy Searching multiple words with Levenshtein


I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:

With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):

SELECT * FROM contents 
  WHERE levenshtein(company_name, 'apple') < 4;

However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).

I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:

user input => 'sony'

SELECT * FROM contents 
  WHERE levenshtein('Sony', 'sony') < 4 
  OR levenshtein('Electronics', 'sony') < 4 
  OR levenshtein('INC', 'sony') < 4;

So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?

Thanks!


Solution

  • Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:

    with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
                          union
                         select 102 "id",'Apple Corp' as "name")
    select sample_data.id,sample_data.name, components.part,
           levenshtein(components.part,'sony',10000,100,1) ld_sony
    from sample_data
    inner join (select sd.id,
                       lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
                from sample_data sd) components on components.id = sample_data.id
    

    The output is so:

     id  |          name          |     part      | ld_sony 
    -----+------------------------+---------------+---------
     101 | Sony Entertainment Inc | sony          |       0
     101 | Sony Entertainment Inc | entertainment |     903
     101 | Sony Entertainment Inc | inc           |   10002
     102 | Apple Corp             | apple         |     104
     102 | Apple Corp             | corp          |       3
    (5 rows)
    
    • Row 1 - no changes..
    • Row 2 - 9 deletions and 3 changes
    • Row 3 - 1 insertion and 2 changes
    • Row 4 - 1 deletion and 4 changes
    • Row 5 - 3 changes

    I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.