Search code examples
google-bigquery

Is Approximate String Matching / Fuzzy String Searching possible with BigQuery?


Thanks to Google for delivering BigQuery, it's great!
Is Approximate String Matching / Fuzzy String Searching possible with BigQuery?
Does Google have plans to add this functionality to BigQuery?

Surely the Google proprietary Approximate String Matching algorithm could be used to deliver this capability to BigQuery while still maintaining Google Intellectual Property. We've searched all the BigQuery documentation and Stack Overflow questions. Of course there are many algorithms to do this, though how to integrate with BigQuery?

Our need is simple, to compare two strings which will be mostly the same though could be slightly different. For example:

"Rhodes USA" vs. "Rhodes USA, LLC", vs. "Rhodes USA LLC".  

From our BigQuery tests it appears two strings need to match EXACTLY for BigQuery to JOIN them, even down to the number of trailing spaces in each string. The addition of this functionality or guidance for integration with BigQuery would be greatly appreciated. This is in support of Milwaukee Jets, a regional, innovative, fractional jet ownership company in Milwaukee, WI. Thanks again Google for delivering BigQuery.

Thank you very much and best regards, AP


Solution

  • Unfortunately, approximate string matching is not supported. The closest you can get is by using regular expressions. Your best bet may be to normalize the data before it gets to BigQuery -- i.e transform "Rhodes USA" and "Rhodes, USA. " into the same string. I'll add a feature request bug for this support, however.