Search code examples
pythonpostgresqllevenshtein-distancefuzzy-search

Postgresql: Levenshtein number for substring match


Problem: I am extracting rows that have the term 'posthole' in a long description field. These are often misspelled. I created a field using the Levenshtein function to calculate the difference between a description and the term 'posthole', but it's matching the entire string. I need to find a way to modify this to calculate the distance to the substring of the string closest to the term 'posthole'

Solution: The only thing I can think of is splitting the string into substrings on spaces and matching every substring to the search term. I just wanted to check if anyone knows of a better method for doing this.

This is pure PostgreSQL for now, but I can plug some Python code into the database if there are modules out there that deal with this issue.


Solution

  • You can split the string into words as rows:

    with inputs (id, textcol) as (
      values (1, 'this is a test of postole and some other posthole expressions'),
             (2, 'just another posthole entry')
    )
    select id, word, levenshtein(upper(word), 'POSTHOLE') 
      from inputs
           cross join lateral regexp_split_to_table(textcol, '\y') r(word) 
     where length(word) > 5
       and levenshtein(upper(word), 'POSTHOLE') < 4
    ;
    
    ┌────┬──────────┬─────────────┐
    │ id │   word   │ levenshtein │
    ├────┼──────────┼─────────────┤
    │  1 │ postole  │           1 │
    │  1 │ posthole │           0 │
    │  2 │ posthole │           0 │
    └────┴──────────┴─────────────┘
    (3 rows)