Search code examples
openrefine

Openrefine: cross.cell for similar but not identical values


I have two dataset:

  • one dataset has names of countries, but dirty ones like

    Gaule Cisalpine (province romaine)

    Gaule belgique

    Gaule , Histoire

    Gaule

    ecc.

  • the second dataset has two columns with the names of countries (clean) and a code like

    Gaule | 1DDF

Is there a way to use cell.cross with value.contains() ? I tried to use reconcile-csv but it didn't work properly (it matches just the exact ones).


Solution

  • I've not been able to think of a great way of doing this, but given the substring you want to match between the two files is always the first thing in the 'messy' string, and if you want to do this in OpenRefine, I can see a way that might work by creating a 'match' column in each project for the cross matching.

    In the 'clean' project use 'Add column based on this column' on the 'Country name' column, and in the GREL transform use:

    value.fingerprint()
    

    The 'fingerprint' transformation is the same as the one used when doing clustering with key collision/fingerprint and basically I'm just using it here to get rid of any minor differences between country names (like upper/lower case or special characters)

    Then in the 'messy' project create a new column based on the dirty 'name of country' column again using the 'Add column based on this column' but in this case use the GREL transform something like:

    value.split(/[\s,-\.\(\)]/)[0].fingerprint()
    

    The first part of this "value.split(/[\s,-.()]/)" splits the string into individual words (using space, comma, fullstop, open or closed bracket as a separator). Then the '[0]' takes the first string (so the first word in the cell), then again uses the fingerprint algorithm.

    Now you have columns in each of the projects which should match on the exact cell content. You can use this to do the look up between the two projects.

    This isn't going to be completely ideal - for example if you have some country names which consist of multiple words it isn't going to work. However you could add some additional key columns to the 'messy' project which use the first 2,3,4 strings etc. rather than just the first one as given here.

    e.g.

    filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,2).join(" ").fingerprint()
    
    filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,3).join(" ").fingerprint()
    

    etc. (I've done a bit more work here to make sure blank entries are ignored - it's the get() command that's the key bit for getting the different numbers of words).

    I'm guessing that most country names are going to be only a few words long, so it would only be a few columns needed.

    I've not been able to come up with a better approach so far. I'll post some more here if I come up with anything else. You could also try asking on the OpenRefine forum https://groups.google.com/forum/#!forum/openrefine