Search code examples
sqlsnowflake-cloud-data-platformmatchingstring-matching

Fuzzy Matching in Different Tables with No Cross Join(Snowflake)


There are two tables A and B.

They both contain titles referencing the same thing, but the naming conventions are always different and cannot be predicted.

The only way to match titles is to find low difference scores on a number of columns, but for now only the title is important.

There are only about 10,000 records in each table currently. Using the standard Cross Join and EditDistance combination works fine now. But I've already noticed performance decreases as the number of records grow.

Is there a more performant way of achieving the desired result of finding partial matches between strings in different tables?

I apologize if there is an obvious answer. The few posts that deviate from the editdistance solution still assume cross joining : https://community.snowflake.com/s/question/0D50Z00008zPLLxSAO/join-with-partial-string-match


Solution

  • You should use a blocking key strategy to help cut down on the number of pairs generated. This document explains this strategy and other techniques for Fuzzy Matching on Snowflake. https://drive.google.com/file/d/1FuxZnXojx71t-1kNOaqg1ErrEiiATdsM/view?usp=sharing