Search code examples
postgresqljoinsql-order-bylimit

JOIN the record with the most similar name with each row from multiple tables


Platform: PostgreSQL

Tables:

shortlist:  name (text), city (text)...
data1:      name (text), ranking (integer), score1 (double)...
data2:      name (text), ranking (integer), score1 (double)...
data3:      name (text), ranking (integer), score1 (double)...
data4:      name (text), ranking (integer), score1 (double)...

There is a limited number of data tables of similar format.


I would like to join each row in shortlist with the row in each data table that has the most similar name determined by similarity(shortlist.name, data#.name).


Pseudo code of the same idea:

for each s_row in shortlist:
    select shortlist.*
    join (SELECT data1.*, similarity(s_row.name, data1.name) AS sim FROM data1 ORDER BY sim DESC LIMIT 1)
    join (SELECT data2.*, similarity(s_row.name, data2.name) AS sim FROM data2 ORDER BY sim DESC LIMIT 1)
    join (SELECT data3.*, similarity(s_row.name, data3.name) AS sim FROM data3 ORDER BY sim DESC LIMIT 1)
    join (SELECT data4.*, similarity(s_row.name, data4.name) AS sim FROM data4 ORDER BY sim DESC LIMIT 1)

Is there a way to do this in SQL?


Solution

  • I am not entirely sure what you are after but something like this:

    select s.name, 
           d1.name as d1_name, 
           d2.name as d2_name
    from shortlist s 
      left join lateral (
        SELECT data1.*, similarity(s.name, data1.name) AS sim 
        FROM data1 
        ORDER BY sim 
        DESC LIMIT 1
      ) d1 on true
      left join lateral (
        SELECT data2.*, similarity(s.name, data2.name) AS sim 
        FROM data2 
        ORDER BY sim DESC 
        LIMIT 1
      ) d2 on true
    

    You want an outer join (left join) for each table because otherwise you would not see anything if there is no match in at least one of the tables.