Problem: I am extracting rows that have the term 'posthole' in a long description field. These are often misspelled. I created a field using the Levenshtein function to calculate the difference between a description and the term 'posthole', but it's matching the entire string. I need to find a way to modify this to calculate the distance to the substring of the string closest to the term 'posthole'
Solution: The only thing I can think of is splitting the string into substrings on spaces and matching every substring to the search term. I just wanted to check if anyone knows of a better method for doing this.
This is pure PostgreSQL for now, but I can plug some Python code into the database if there are modules out there that deal with this issue.
You can split the string into words as rows:
with inputs (id, textcol) as (
values (1, 'this is a test of postole and some other posthole expressions'),
(2, 'just another posthole entry')
)
select id, word, levenshtein(upper(word), 'POSTHOLE')
from inputs
cross join lateral regexp_split_to_table(textcol, '\y') r(word)
where length(word) > 5
and levenshtein(upper(word), 'POSTHOLE') < 4
;
┌────┬──────────┬─────────────┐
│ id │ word │ levenshtein │
├────┼──────────┼─────────────┤
│ 1 │ postole │ 1 │
│ 1 │ posthole │ 0 │
│ 2 │ posthole │ 0 │
└────┴──────────┴─────────────┘
(3 rows)