Search code examples
clickhouse

Why is ClickHouse dictionary performance so low?


I have a table with products names in PostgreSql database. Total rows is ~30M. And I have history of prices in ClickHouse. I want to join names to prices. DDL to create dictionary:

CREATE DICTIONARY products_dict
(
    product_id String,
    name String
)
PRIMARY KEY product_id
SOURCE(POSTGRESQL(
    ...
    query 'SELECT product_id, name FROM products'
))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(3600);

Then I have dictionary:

database:                    wdm
name:                        products_dict
uuid:                        1464ba09-990c-4e69-9464-ba09990c0e69
status:                      LOADED
origin:                      1464ba09-990c-4e69-9464-ba09990c0e69
type:                        ComplexKeyHashed
key.names:                   ['product_id']
key.types:                   ['String']
attribute.names:             ['name']
attribute.types:             ['String']
bytes_allocated:             4831830312
query_count:                 57912282
hit_rate:                    1
found_rate:                  1
element_count:               28956140
load_factor:                 0.4314801096916199
source:                      PostgreSQL: ...
lifetime_min:                0
lifetime_max:                3600
loading_start_time:          2022-01-17 03:53:21
last_successful_update_time: 2022-01-17 03:54:46
loading_duration:            84.79
last_exception:              
comment:                     

Also I've got table for this dictionary:

-- auto-generated definition
create table products_dict
(
    product_id String,
    name       String
)
    engine = Dictionary;

When I query this dictionary, it tooks ~3 sec.

  1. One id with WHERE IN
SELECT name FROM products_dict WHERE  product_id IN ('97646221')
1 row retrieved starting from 1 in 2 s 891 ms (execution: 2 s 841 ms, fetching: 50 ms)
  1. 501 products without conditions and sorting
SELECT t.*
     FROM products_dict t
     LIMIT 501
500 rows retrieved starting from 1 in 2 s 616 ms (execution: 2 s 601 ms, fetching: 15 ms)
  1. JOIN
SELECT ppd.*, p.name
              FROM
                  (
                      SELECT
                          product_id,
                          price
                      FROM product_prices_daily
                      WHERE
                            product_id IN ('97646221','97646318','976464823','97647223','976472425','976474961','976476908')
                            AND day between '2022-01-13' and '2022-01-14'
                  ) as ppd
          LEFT JOIN products_dict as p ON p.product_id = ppd.product_id
4 rows retrieved starting from 1 in 6 s 984 ms (execution: 6 s 959 ms, fetching: 25 ms)

DBMS: ClickHouse (ver. 21.12.3.32) Client: DataGrip 2021.3.2 Server: 128 GB RAM, dozens of cores, 3TB SSD without any load. Reading from 16 billion MergeTree table by product_id tooks ~100ms. I've tested manually created table with engine=dictionary and got the same results. I cannot use flat layout because product_id is string.

Another test with clickhouse-client:

ch01 :) SELECT name FROM products_dict WHERE  product_id IN ('97646239');

SELECT name
FROM products_dict
WHERE product_id IN ('97646239')

Query id: d4f467c9-be0e-4619-841b-a76251d3e714

┌─name──┐
│ ...│
└───────┘

1 rows in set. Elapsed: 2.859 sec. Processed 28.96 million rows, 2.30 GB (10.13 million rows/s., 803.25 MB/s.)

What's wrong?


Solution

  • Such optimization is not implemented, yet.

    Initially supposed that dictionaries to be used with only dictGet functions.

    Table representation were introduced much later.

    Internally Dictionaries are the set of hash tables -- if your dictionary has 50 attributes then it will be 50 hash tables. These hash tables are very fast if you do seek by key, but very slow if you need to find the next element.

    Right now the query SELECT name FROM products_dict WHERE product_id IN ('97646239') is executed in very straightforward way, though it could be converted into dictGet under the hood.