I seem to have a workflow problem with Open Refine (Google Refine 2.5 [r2407]) to do sophisticated duplicate row cleaning. All I have found so far is how to delete duplicate rows based on a single column.
My aim is to delete duplicate rows based on multiple columns, at best, in a specific hierarchy.
Given the following dummy data in Refine
+----+---------+---------+--------+------------+------+-----------------------------------+
| id | timeAgo | title | author | date | val1 | [After Refine, keep Record] |
+----+---------+---------+--------+------------+------+-----------------------------------+
| 1 | 10 | Faust | Mr. A | 2014-01-15 | 10 | ->B, older entry |
| 2 | 11 | Faust | Mr. A | 2014-01-21 | 10 | A (because of Date) |
| 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 | B |
| 4 | 8 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, older entry |
| 5 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, same time Ago, but lower ID |
| 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 | C (because of author, date, val1) |
| 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | D |
+----+---------+---------+--------+------------+------+-----------------------------------+
I want to kill the duplicate rows based on following logic. If
The Result would be:
+---------+----+---------+---------+--------+------------+------+
| Refined | id | timeAgo | title | author | date | val1 |
+---------+----+---------+---------+--------+------------+------+
| A | 2 | 10 | Faust | Mr. A | 2014-01-21 | 10 |
| B | 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 |
| C | 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 |
| D | 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 |
+---------+----+---------+---------+--------+------------+------+
If there is no other solution, I thankfully take a scripting/GREL one.
But could it be done by Refines famous workflow "recording" to achieve above logic, so it could be extracted and applied to other same format datasets?
My motivation behind this is to enable employees to work more thoughtfully with data (beyond excel) but without confronting them right away with a full blown scripting language.
That sounds like a straightforward sorting problem.
value.split(',')[0]
to extract the first value (which should be the value for the record you want if you sorted them in the right order