Search code examples
sqlpostgresqlpattern-matching

Which search strategy in SQL (postgres) to use in order to find similar strings


I have a table with records in one column which are just different by how they are written. So how can I find those and save the corresponding id's in a new table?

e.g. I have the following records in a column for cities.

Id name
1 berlin
2 ber lin
3 ber-lin
4 Berlin
5 Hamburg
6 New York
7 NewYork

So my first assumption would be to remove any special characters including white spaces, then lowercase. And see who matches and then write the id to a new table?

What would be the best and most reliable way to find machtes?


Solution

  • If removing some characters (' ' and '-' in the example) and lower-casing is enough to identify duplicates:

    CREATE TABLE tbl_folded AS
    SELECT lower(translate(name, ' -', '')) AS base_name
         , array_agg(id) AS ids
    FROM   tbl
    GROUP  BY 1;
    

    SQL Fiddle

    translate() is particularly useful to replace (or remove) a list of single characters.
    Use CREATE TABLE AS to create a new table from the results of a query.

    Related: