Search code examples
escapingdatabricksdatabricks-sql

Databricks sql, how to compare string escaping characters


I'm analysing two columns to filter in records that don't look alike:

select firstname, name 
from tableA
where lower(firstname) != lower(name)

I accept names that match even if few special characters don't, for instance:

enter image description here

These two names to Databricks are deemed to be different (rightly so) but I want to escape that '-' charachter. Is there a way to say 'look for different names without considering the '-' character'? Or maybe a way to get a index on how much the two strings differ.


Solution

  • There are several approaches for that:

    • Do normalization of strings before comparison - lower or upper case the strings, replace all characters like -, multiple spaces, etc. with single space - you can use regexp_replace function for that. Something like this:
    select firstname, name 
    from tableA
    where 
      regex_replace(lower(firstname), '[- ]+', ' ') != 
        regex_replace(lower(name), '[- ]+', ' ')
    

    It could be further simplified by creating a user-defined SQL function for that normalization.

    select firstname, name 
    from tableA
    where levenstein(lower(firstname), lower(name)) != 0