python postgresql levenshtein-distance fuzzy-search

Postgresql: Levenshtein number for substring match

Problem: I am extracting rows that have the term 'posthole' in a long description field. These are often misspelled. I created a field using the Levenshtein function to calculate the difference between a description and the term 'posthole', but it's matching the entire string. I need to find a way to modify this to calculate the distance to the substring of the string closest to the term 'posthole'

Solution: The only thing I can think of is splitting the string into substrings on spaces and matching every substring to the search term. I just wanted to check if anyone knows of a better method for doing this.

This is pure PostgreSQL for now, but I can plug some Python code into the database if there are modules out there that deal with this issue.

Solution

You can split the string into words as rows:

with inputs (id, textcol) as (
  values (1, 'this is a test of postole and some other posthole expressions'),
         (2, 'just another posthole entry')
)
select id, word, levenshtein(upper(word), 'POSTHOLE') 
  from inputs
       cross join lateral regexp_split_to_table(textcol, '\y') r(word) 
 where length(word) > 5
   and levenshtein(upper(word), 'POSTHOLE') < 4
;

┌────┬──────────┬─────────────┐
│ id │   word   │ levenshtein │
├────┼──────────┼─────────────┤
│  1 │ postole  │           1 │
│  1 │ posthole │           0 │
│  2 │ posthole │           0 │
└────┴──────────┴─────────────┘
(3 rows)