I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.
I checked Clickhouse's semi-structured JSON data options, and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.
However I cannot make a proper assessment which one is best, as I get pretty similar results.
To test performance I created the following table:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String)
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 8192;
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(200000000);
--- data look like these:
SELECT *
FROM maptest
LIMIT 1
Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0
┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0'] │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘
However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. e.g.
SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)
SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)
Any suggestions on data skipping indexes for any of the 2 options?
You can add data skipping index for a map field, although you will need to set lower index_granularity
to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues
(or mapKeys
, depending on your needs) map function:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String),
INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 2048; -- < lowered index_granularity!
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(20000000);
Now let's test it:
set send_logs_level='trace';
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2';
(...)
[LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
(...)
┌─count()─┐
│ 230 │
└─────────┘
(...)
1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)