I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:
With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):
SELECT * FROM contents
WHERE levenshtein(company_name, 'apple') < 4;
However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).
I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:
user input => 'sony'
SELECT * FROM contents
WHERE levenshtein('Sony', 'sony') < 4
OR levenshtein('Electronics', 'sony') < 4
OR levenshtein('INC', 'sony') < 4;
So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?
Thanks!
Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:
with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id
The output is so:
id | name | part | ld_sony
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.