I have a list with some fuzzy variables and a list with some clean variables. I want to want to (fuzzy) match both of them where the compged distance is lowest and some rules regarding their edit distances are satisfied (f.ex. compged < 100 and spedis < 50). I tried the following code
proc sql noprint;
create table out as
select min(compged(fuzzy.fuzzy_title,clean.cleaned_title,100))
as comp,
fuzzy.fuzzy_title, clean.cleaned_title
from fuzzy inner join clean
on (compged(fuzzy.fuzzy_title,clean.cleaned_title,100) < 100 and
spedis(clean.cleaned_title,fuzzy.fuzzy_title) < 50);
quit;
The datasets fuzzy and clean basically just contain the titles that I want to match. The code I use just gives me the minimum compged score of the whole dataset and then some arbitrary match where my condition regarding the distances are satisfied. Is there a way to choose exactly the clean_title with the minimum compged score for a given fuzzy_title? I might have searched wrong, but I couldn't find the answer to this.
I think you are looking for group by
+ having
:
proc sql;
create table out as
select
compged(fuzzy.fuzzy_title,clean.cleaned_title,100)as comp
,fuzzy.fuzzy_title
,clean.cleaned_title
from fuzzy inner join clean
on (compged(fuzzy.fuzzy_title,clean.cleaned_title,100) < 100
and spedis(clean.cleaned_title,fuzzy.fuzzy_title) < 50)
group by fuzzy.fuzzy_title
having calculated comp = min(compged(fuzzy.fuzzy_title,clean.cleaned_title,100))
;quit;
If there are more fuzzy_title
+ cleaned_title
pairs having the same comp
value, all of them will be in the output. You can select only one of them in a single query. However, I think it's easier to keep those steps separated and select one row for each fuzzy_title
in another query (e.g. using first
data step variable).