Search code examples
postgresqlsimilaritylevenshtein-distancefuzzytrigram

plpgsql function calling trigram similarity function inside does not utilize GIN or GIST indexes


I wanted to combine PostgreSQL Levenshtein and trigram similarity functions. The main advantage of the trigram similarity function is that it can utilize GIN or GIST indexes and thus can return fuzzy match results quickly. However, if it is called inside another function, it does not use indexes. For sake of this problem illustration, here is a plpgsql function "trigram_similarity" that calls original trigram's "similarity" function.

CREATE OR REPLACE FUNCTION public.trigram_similarity(
    left_string text,
    right_string text)
  RETURNS real AS
$BODY$
BEGIN
  RETURN similarity(left_string, right_string);
END;$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT
  COST 100;
ALTER FUNCTION public.trigram_similarity(text, text)
  OWNER TO postgres;

Although the function actually just calls the trigram's similarity function, it behaves completely different when it comes to GIN indexes utilization. While the original trigram's similarity function inside WHERE clause of a query does utilize GIN indexes and thus a query returns result quickly and without much RAM consumption, when using trigram_similarity it does not. For large datasets fuzzy match analysis, the RAM is completely used and application freezes... For sake of illustration, here is an example query:

SELECT DISTINCT  
trigram_similarity(l.l_composite_18, r.r_composite_18)
::numeric(5,4) AS trigram_similarity_composite_score 
, (trigram_similarity(l."Name", r."Name")*(0.166666666666667) 
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333) 
+ trigram_similarity(l."County", r."District")*(0.0416666666666667) 
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667) 
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667) 
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667) 
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667) 
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667) 
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667) 
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667) 
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333) 
)
::numeric(5,4) AS trigram_similarity_weighted_score  
, l."ClanID" AS "l_ClanID_1"
, l."Name" AS "l_Name_2"
, l."LastName" AS "l_LastName_3"
, l."County" AS "l_County_4"
, l."Town" AS "l_Town_5"
, l."PostalCode" AS "l_PostalCode_6"
, l."PostOffice" AS "l_PostOffice_7"
, l."Street" AS "l_Street_8"
, l."Number" AS "l_Number_9"
, l."Telephone1" AS "l_Telephone1_10"
, l."Telephone2" AS "l_Telephone2_11"
, l."EMail" AS "l_EMail_12"
, l."BirthDate" AS "l_BirthDate_13"
, l."Gender" AS "l_Gender_14"
, l."Aktivan" AS "l_Aktivan_15"
, l."ProgramCode" AS "l_ProgramCode_16"
, l."Card" AS "l_Card_17"
, l."DateOfCreation" AS "l_DateOfCreation_18"
, l."Assigned" AS "l_Assigned_19"
, l."Reserved" AS "l_Reserved_20"
, l."Sent" AS "l_Sent_21"
, l."MemberOfBothPrograms" AS "l_MemberOfBothPrograms_22"
, r."ClanID" AS "r_ClanID_23"
, r."Name" AS "r_Name_24"
, r."Surname" AS "r_Surname_25"
, r."District" AS "r_District_26"
, r."Location" AS "r_Location_27"
, r."ZipCode" AS "r_ZipCode_28"
, r."PostOffice" AS "r_PostOffice_29"
, r."Road" AS "r_Road_30"
, r."HomeNumber" AS "r_HomeNumber_31"
, r."Phone1" AS "r_Phone1_32"
, r."Phone2" AS "r_Phone2_33"
, r."EMail" AS "r_EMail_34"
, r."DateOfBirth" AS "r_DateOfBirth_35"
, r."Sex" AS "r_Sex_36"
, r."Active" AS "r_Active_37"
, r."ProgramCode" AS "r_ProgramCode_38"
, r."CardNo" AS "r_CardNo_39"
, r."CreationDate" AS "r_CreationDate_40"
, r."Assigned" AS "r_Assigned_41"
, r."Reserved" AS "r_Reserved_42"
, r."Sent" AS "r_Sent_43"
, r."BothPrograms" AS "r_BothPrograms_44"
 FROM  "l_leftdatasetexample3214274191" AS l, "r_rightdatasetexample3214274191" AS r
 WHERE  
((l."Gender"=r."Sex") AND (l."Card"<>r."CardNo") AND (l."Name"=r."Name"))
 AND 
((l."ProgramCode"= '1') AND (r."ProgramCode"= '1'))
 AND 
(((l.l_composite_18 % r.r_composite_18)
)
 OR (((l."Name" % r."Name")
OR (l."LastName" % r."Surname")
OR (l."County" % r."District")
OR (l."Town" % r."Location")
OR (l."PostalCode" % r."ZipCode")
OR (l."PostOffice" % r."PostOffice")
OR (l."Street" % r."Road")
OR (l."Number" % r."HomeNumber")
OR (l."Telephone1" % r."Phone1")
OR (l."Telephone2" % r."Phone2")
OR (l."EMail" % r."EMail")
OR (l."BirthDate" % r."DateOfBirth")
OR (l."Gender" % r."Sex"))
)
) AND ((trigram_similarity(l.l_composite_18, r.r_composite_18)
 >= 0.7) 
 OR ((trigram_similarity(l."Name", r."Name")*(0.166666666666667) 
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333) 
+ trigram_similarity(l."County", r."District")*(0.0416666666666667) 
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667) 
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667) 
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667) 
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667) 
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667) 
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667) 
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667) 
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333) 
)
 >= 0.7)
 ) ORDER  BY trigram_similarity_composite_score DESC;

This query causes RAM clottage and application freezes. When "trigram_similarity" is replaced with "similarity", the query executes fast and without RAM overconsumption. Why "trigram_similarity" and "similarity" behave differently? Is there a way I could force GIN or GIST indexes utilization for this "trigram_similarity" function or any other function calling trigram's similarity function inside?

Explain analyze when "similarity" is used:

"Unique  (cost=170717.94..177633.17 rows=58853 width=383) (actual time=260362.193..260362.279 rows=99 loops=1)"
"  Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text) * '0 (...)"
"  Buffers: shared hit=2513871 read=4158"
"  ->  Sort  (cost=170717.94..170865.07 rows=58853 width=383) (actual time=260362.192..260362.198 rows=99 loops=1)"
"        Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text (...)"
"        Sort Key: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)) DESC, (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname" (...)"
"        Sort Method: quicksort  Memory: 76kB"
"        Buffers: shared hit=2513871 read=4158"
"        ->  Nested Loop  (cost=0.29..155793.36 rows=58853 width=383) (actual time=1851.503..260361.609 rows=99 loops=1)"
"              Output: (similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4), ((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::t (...)"
"              Buffers: shared hit=2513871 read=4158"
"              ->  Seq Scan on public.r_rightdatasetexample3214274191 r  (cost=0.00..11228.86 rows=101669 width=188) (actual time=9.149..67.134 rows=50837 loops=1)"
"                    Output: r."ClanID", r."Name", r."Surname", r."District", r."Location", r."ZipCode", r."PostOffice", r."Road", r."HomeNumber", r."Phone1", r."Phone2", r."EMail", r."DateOfBirth", r."Sex", r."Active", r."ProgramCode", r."CardNo", r."Creat (...)"
"                    Filter: ((r."ProgramCode")::text = '1'::text)"
"                    Buffers: shared hit=5800 read=4158"
"              ->  Index Scan using "idxbNameA8D72F00099E4B70885B2E0BB1DFB684l_leftdatasetexample321" on public.l_leftdatasetexample3214274191 l  (cost=0.29..1.35 rows=1 width=195) (actual time=5.111..5.119 rows=0 loops=50837)"
"                    Output: l."ClanID", l."Name", l."LastName", l."County", l."Town", l."PostalCode", l."PostOffice", l."Street", l."Number", l."Telephone1", l."Telephone2", l."EMail", l."BirthDate", l."Gender", l."Aktivan", l."ProgramCode", l."Card", l."D (...)"
"                    Index Cond: ((l."Name")::text = (r."Name")::text)"
"                    Filter: (((l."ProgramCode")::text = '1'::text) AND ((l."Card")::text <> (r."CardNo")::text) AND ((r."Sex")::text = (l."Gender")::text) AND (((l.l_composite_18)::text % (r.r_composite_18)::text) OR ((l."Name")::text % (r."Name")::text) O (...)"
"                    Rows Removed by Filter: 50"
"                    Buffers: shared hit=2508071"
"Planning time: 13.885 ms"
"Execution time: 260362.730 ms"

Solution

  • It is possible to create GIN trgm_ops expression index on the compound (concatenated) expression. This index can faciliate % similarity operator, but not the similarity function.