I need to remove all duplicates from a set of arrays, but we define 'duplicate' in a special way here: Two 4 element arrays are 'dupes' if they share the first two elements in any order and the last two elements in any order. So my thought is to split these arrays into 2 halves, sort those 2-element half arrays, and put them back together again to form 4-element arrays. Then we will have some textbook duplicates we can remove.
Is this a good approach?
We start with a set of 6 4-element arrays, none of which is an exact duplicate of another.
[6, 4, 3, 2]
[4, 6, 2, 3]
[3, 4, 2, 6]
[4, 3, 6, 2]
[3, 6, 2, 4]
[6, 3, 4, 2]
split each array in the middle
[[6, 4], [3, 2]]
[[4, 6], [2, 3]]
[[3, 4], [2, 6]]
[[4, 3], [6, 2]]
[[3, 6], [2, 4]]
[[6, 3], [4, 2]]
Here's the hard part in Neo4j! Sort each of the two inner arrays only.
[[4, 6], [2, 3]]
[[4, 6], [2, 3]]
[[3, 4], [2, 6]]
[[3, 4], [2, 6]]
[[3, 6], [2, 4]]
[[3, 6], [2, 4]]
Put them back together.
[4, 6, 2, 3]
[4, 6, 2, 3]
[3, 4, 2, 6]
[3, 4, 2, 6]
[3, 6, 2, 4]
[3, 6, 2, 4]
Dedupe by using DISTINCT.
[4, 6, 2, 3]
[3, 4, 2, 6]
[3, 6, 2, 4]
This very simple query (with your sample data) implements your approach, which seems reasonable:
WITH [
[6, 4, 3, 2],
[4, 6, 2, 3],
[3, 4, 2, 6],
[4, 3, 6, 2],
[3, 6, 2, 4],
[6, 3, 4, 2]
] AS data
UNWIND data AS d
RETURN DISTINCT
CASE WHEN d[0] > d[1] THEN [d[1], d[0]] ELSE d[0..2] END +
CASE WHEN d[2] > d[3] THEN [d[3], d[2]] ELSE d[2..] END AS res;
The result is:
+-----------+
| res |
+-----------+
| [4,6,2,3] |
| [3,4,2,6] |
| [3,6,2,4] |
+-----------+
The following query will accept as input a collection of sub-collections of even size (does not have to be 4). It will return a collection of distinct properly internally "sorted" collections.
For example (notice that the sub-collections do not have to be the same size):
WITH [
[6, 4, 3, 2, 3, 2],
[3, 4, 2, 6, 7, 8],
[4, 3, 6, 2, 8, 7],
[3, 6, 2, 4],
[6, 3, 4, 2],
[4, 6, 2, 3, 2, 3]
] AS data
WITH EXTRACT(d IN data |
REDUCE(s = [], i IN RANGE(0, SIZE(d)-1, 2) | s + CASE WHEN d[i] > d[i+1] THEN [d[i+1], d[i]] ELSE d[i..i+2] END)) AS sorted
UNWIND sorted AS res
RETURN DISTINCT res;
The output of the above is:
+---------------+
| res |
+---------------+
| [4,6,2,3,2,3] |
| [3,4,2,6,7,8] |
| [3,6,2,4] |
+---------------+