Search code examples
jythonopenrefine

OpenRefine remove duplicates from list with jython


I have a column with values that are duplicated e.g.

VMS5796,VMS5650,VMS5650,CSL,VMA5216,CSL,VMA5113

I'm applying a transform using jython that removes the duplicates (On error is set to keep original), here's the code:

return list(set(value.split(",")))

Which works in the preview, but isn't getting applied to the column. What am I doing wrong?


Solution

  • The Map function is very powerful and an underused function in Python / Jython. It probably is unclear what this code does internally, but it is extremely fast in processing millions of bits of values from a list or array in your columns cells' values that need to be 'mapped' as a string type and then applying a join with a separator char such as a comma ', '

    deduped_list = list(set(value.split(",")))
    return ', '.join(map(str, deduped_list))
    

    There are probably other, even slightly faster variations than this, but this should get you going in the right direction.

    Interestingly, you can also get the 'printable representation' repr(object) which is acceptable to an EVAL like OpenRefine's and can be useful for seeing the representation of your values as well..., which I just found out about, researching this answer in more depth for you.

    deduped_list = list(set(value.split(",")))
    return ', '.join(map(repr, deduped_list))