Search code examples
clickhouse

Join two datasets with key duplicates one by one


I need to join two datasets from e.g. left and right source to match values by some keys. Datasets can contain duplicates:

┌─key─┬─value──┬─source──┐
│ 1   │ val1   │ left    │
│ 1   │ val1   │ left    │   << duplicate from left source
│ 1   │ val1   │ left    │   << another duplicate from left source
│ 1   │ val1   │ right   │
│ 1   │ val1   │ right   │   << duplicate from right source
│ 2   │ val2   │ left    │
│ 2   │ val3   │ right   │
└─────┴────────┴─-----───┘

I cant use full join, it gives cartesian products of all duplicates.
I am trying to use group by instead:

select
  `key`,
  anyIf(value, source = 'left') as left_value,
  anyIf(value, source = 'right') as right_value
from test_raw
group by key;

It works good, but is there any way to match left and right duplicates?
Expected result:

┌─key─┬─left_value─┬─right_value─┐
│ 1   │ val1       │ val1        │
│ 1   │ val1       │ val1        │   
│ 1   │ val1       │             │   
│ 2   │ val2       │ val3        │
└─────┴────────────┴─────────────┘

Scripts to reproduce:

create table test_raw
(`key` Int64,`value` String,`source` String)
ENGINE = Memory;

insert into test_raw (`key`,`value`,`source`)
values
(1, 'val1', 'left'),
(1, 'val1', 'left'),
(1, 'val1', 'left'),
(1, 'val1', 'right'),
(1, 'val1', 'right'),
(2, 'val2', 'left'),
(2, 'val3', 'right');

select
  `key`,
  anyIf(value, source = 'left') as left_value,
  anyIf(value, source = 'right') as right_value
from test_raw
group by key;

Solution

  • SELECT
        key,
        left_value,
        right_value
    FROM
    (
        SELECT
            key,
            arraySort(groupArrayIf(value, source = 'left')) AS l,
            arraySort(groupArrayIf(value, source = 'right')) AS r,
            arrayMap(i -> (l[i + 1], r[i + 1]), range(greatest(length(l), length(r)))) AS t
        FROM test_raw
        GROUP BY key
    )
    ARRAY JOIN
        t.1 AS left_value,
        t.2 AS right_value
    ORDER BY key ASC
    
    ┌─key─┬─left_value─┬─right_value─┐
    │   1 │ val1       │ val1        │
    │   1 │ val1       │ val1        │
    │   1 │ val1       │             │
    │   1 │ val1       │             │
    │   2 │ val2       │ val3        │
    └─────┴────────────┴─────────────┘