Search code examples
sqlstringgoogle-bigquerystring-comparison

String comparison using BigQuery


I have a table with people and their hometown names, but there are same cities with different written, see:

Name Hometown
João São Paulo
Maria Sao Paul
Pedro São Paulo.
Maria S. Paulo

And I need to process this in order to formalize that data to be like this:

Name Hometown
João São Paulo
Maria São Paulo
Pedro São Paulo
Maria São Paulo
  • The dataset has more than 2400 distinct values so I can't hard code.
  • I have a Country table dimension with all cities and their correct names.

I tried this stack and would it be exactly what I need but does not work with my entire dataset.


Solution

  • Consider below approach (considering you have lookup table with all proper cities names) for purpose of example - I have it as CTE with just few ones

    with cities as (
      select 'São Paulo' as city union all 
      select 'Los Angeles' union all 
      select 'Dnipro' union all 
      select 'Kyiv'
    )
    select Name, City as Hometown
    from your_table 
    left join cities 
    on soundex(Hometown) = soundex(city)      
    

    if applied to sample data in your question - output is

    enter image description here

    Note: you obviously need to take care of potential duplication in case if some cities sounds similar, in this case adding country constraints might help ...