Search code examples
openrefine

Google Refine: merge two data sets on similar-but-different column?


I've got two datasets, with common column name that are just slightly different.

Dataset A:

Name,Number
Hartlepool UA,306
Tyne and Wear Met County,683

Dataset B:

Name,Number
Hartlepool,366
Tyne and Wear,353

I thought Google Refine / Open Refine was designed to help reconcile this sort of messy data in a fuzzy way, but it seems not.

Is there any tool I can use to match these in a fuzzy way, without having to write code? I can write code easily enough if needed, but I'm surprised that Google Refine does not seem to do what I though it did.


Solution

  • You could also just:

    1. Strip all punctuation and consider the address cell as one big long string, if it's not already in one cell, you need to use some join()'s. Once it is a big string in a single cell, then use the Facet -> Custom Text Facet -> fingerprint() GREL function.

      fingerprint(value)
      value.fingerprint()
      
    2. Using that big long string, you can also perform your fuzzy analysis with Clustering or with a Custom Text Facet and

      value.ngramFingerprint(2)
      

    Those are probably still not the best solutions for your type of messy data.... Addresses. especially, if your considering your 2 examples to be "Same As", which they are not. And so you will need to take the approach that Tom suggests.

    But if your really trying to say...hey, most of the words and numbers in these 2 address are the same, and I think they are pretty close in my context, and want to measure in some way the closeness of the 2 strings fingerprints... then the above 2 steps should get you pointed in the right direction.