Search code examples
sasfuzzy-searchproc-sql

Fuzzy match in proc sql, select minimum distance


I have a list with some fuzzy variables and a list with some clean variables. I want to want to (fuzzy) match both of them where the compged distance is lowest and some rules regarding their edit distances are satisfied (f.ex. compged < 100 and spedis < 50). I tried the following code

proc sql noprint;
create table out as
select min(compged(fuzzy.fuzzy_title,clean.cleaned_title,100)) 
as comp,
fuzzy.fuzzy_title, clean.cleaned_title
from fuzzy inner join clean
on (compged(fuzzy.fuzzy_title,clean.cleaned_title,100) < 100 and 
spedis(clean.cleaned_title,fuzzy.fuzzy_title) < 50);
quit;

The datasets fuzzy and clean basically just contain the titles that I want to match. The code I use just gives me the minimum compged score of the whole dataset and then some arbitrary match where my condition regarding the distances are satisfied. Is there a way to choose exactly the clean_title with the minimum compged score for a given fuzzy_title? I might have searched wrong, but I couldn't find the answer to this.


Solution

  • I think you are looking for group by + having:

       proc sql;
        create table out as
        select 
             compged(fuzzy.fuzzy_title,clean.cleaned_title,100)as comp
            ,fuzzy.fuzzy_title
            ,clean.cleaned_title
        from fuzzy inner join clean
            on (compged(fuzzy.fuzzy_title,clean.cleaned_title,100) < 100 
            and spedis(clean.cleaned_title,fuzzy.fuzzy_title) < 50)
        group by fuzzy.fuzzy_title
        having calculated comp = min(compged(fuzzy.fuzzy_title,clean.cleaned_title,100))
        ;quit;
    

    If there are more fuzzy_title + cleaned_title pairs having the same comp value, all of them will be in the output. You can select only one of them in a single query. However, I think it's easier to keep those steps separated and select one row for each fuzzy_title in another query (e.g. using first data step variable).