Search code examples
openrefine

Keep newest duplicate row depending on multiple Columns


I seem to have a workflow problem with Open Refine (Google Refine 2.5 [r2407]) to do sophisticated duplicate row cleaning. All I have found so far is how to delete duplicate rows based on a single column.

My aim is to delete duplicate rows based on multiple columns, at best, in a specific hierarchy.

Example

Given the following dummy data in Refine

+----+---------+---------+--------+------------+------+-----------------------------------+
| id | timeAgo |  title  | author |    date    | val1 |    [After Refine, keep Record]    |
+----+---------+---------+--------+------------+------+-----------------------------------+
|  1 |      10 | Faust   | Mr. A  | 2014-01-15 |   10 | ->B, older entry                  |
|  2 |      11 | Faust   | Mr. A  | 2014-01-21 |   10 | A (because of Date)               |
|  3 |       8 | Faust   | Mr. A  | 2014-01-15 |   10 | B                                 |
|  4 |       8 | RedHead | Mr. B  | 2014-01-21 |   34 | ->D, older entry                  |
|  5 |       7 | RedHead | Mr. B  | 2014-01-21 |   34 | ->D, same time Ago, but lower ID  |
|  6 |       7 | RedHead | Mr. A  | 2014-01-01 |   13 | C (because of author, date, val1) |
|  7 |       7 | RedHead | Mr. B  | 2014-01-21 |   34 | D                                 |
+----+---------+---------+--------+------------+------+-----------------------------------+

I want to kill the duplicate rows based on following logic. If

  • title && auther && date && val1 are the same, than
  • keep the newest (least timeAgo) row, if there are multiple, than
  • keep the one with the highest id

The Result would be:

+---------+----+---------+---------+--------+------------+------+
| Refined | id | timeAgo |  title  | author |    date    | val1 |
+---------+----+---------+---------+--------+------------+------+
| A       |  2 |      10 | Faust   | Mr. A  | 2014-01-21 |   10 |
| B       |  3 |       8 | Faust   | Mr. A  | 2014-01-15 |   10 |
| C       |  6 |       7 | RedHead | Mr. A  | 2014-01-01 |   13 |
| D       |  7 |       7 | RedHead | Mr. B  | 2014-01-21 |   34 |
+---------+----+---------+---------+--------+------------+------+

Easy Approach?

If there is no other solution, I thankfully take a scripting/GREL one.

But could it be done by Refines famous workflow "recording" to achieve above logic, so it could be extracted and applied to other same format datasets?

My motivation behind this is to enable employees to work more thoughtfully with data (beyond excel) but without confronting them right away with a full blown scripting language.


Solution

  • That sounds like a straightforward sorting problem.

    1. Sort the records by title, author, time ago, and ID
    2. Re-order rows permanently (IMPORTANT - it won't work if you forget this step)
    3. Blank down on Title & Author
    4. Move those two columns to the two left most positions
    5. Join multivalued cells on remaining columns
    6. Transform all columns from step 5 using value.split(',')[0] to extract the first value (which should be the value for the record you want if you sorted them in the right order