Search code examples
clickhouse

Data skipping index for Map or pair-wise arrays in Clickhouse?


I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.

I checked Clickhouse's semi-structured JSON data options, and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.

However I cannot make a proper assessment which one is best, as I get pretty similar results.

To test performance I created the following table:

CREATE TABLE maptest
 (
     `k` Int64,
     `keys` Array(String),
     `values` Array(String),
     `map` Map(String, String)
  )
 ENGINE = MergeTree
 ORDER BY k
 SETTINGS index_granularity = 8192;

insert into maptest 
select 
    number, 
    mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))), 
    mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))), 
    map(concat('custom', toString(number%87000)), toString(number%87000)) 
from numbers(200000000);

--- data look like these:

SELECT *
FROM maptest
LIMIT 1

Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0

┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0']  │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘

However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. e.g.

SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0

┌─count()─┐
│    2299 │
└─────────┘

1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)

SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'


┌─count()─┐
│    2299 │
└─────────┘

1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)



SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'


┌─count()─┐
│    2299 │
└─────────┘

1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)

Any suggestions on data skipping indexes for any of the 2 options?


Solution

  • You can add data skipping index for a map field, although you will need to set lower index_granularity to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues (or mapKeys, depending on your needs) map function:

    CREATE TABLE maptest
     (
         `k` Int64,
         `keys` Array(String),
         `values` Array(String),
         `map` Map(String, String),
          INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
      )
     ENGINE = MergeTree
     ORDER BY k
     SETTINGS index_granularity = 2048; -- < lowered index_granularity!
    
    insert into maptest 
    select 
        number, 
        mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))), 
        mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))), 
        map(concat('custom', toString(number%87000)), toString(number%87000)) 
    from numbers(20000000);
    

    Now let's test it:

    set send_logs_level='trace';
    
    SELECT count()
    FROM maptest
    WHERE (map['custom2']) = '2';
    
    (...)
    [LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
    (...)
    ┌─count()─┐
    │     230 │
    └─────────┘
    (...)
    1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)