Search code examples
performancesampleclickhouse

Using SAMPLE in ClickHouse seems to read all rows and more bytes. Is this expected or due to sub-optimal table definition?


I'm hoping to trade accuracy for speed by using SAMPLE. While some queries do run faster, the query_log shows that all rows are being read and more bytes are read than when not using SAMPLE. I thought SAMPLE would result in fewer row reads.

My questions:

  1. Should using SAMPLE result in more rows and bytes being read?

  2. Is this increased reading due to a sub-optimal table definition that can be corrected?

I'm using ClickHouse version 20.5.3 revision 54435.

Table definition:

CREATE TABLE default.table_one
(
  `datestamp` Date,
  `timestamp` UInt64,
  `sample_hash` UInt64,
  `value` UInt32,
  ...
)
ENGINE = MergeTree()
PARTITION BY date_stamp
ORDER BY (datestamp, timestamp, sample_hash)
SAMPLE BY sample_hash
SETTINGS index_granularity = 8192

Query without SAMPLE

SELECT
  avg(value)
FROM default.table_one;

query_duration_ms: 166
rows_read: 100,000,000
read_bytes: 800,000,000

Query with SAMPLE

SELECT
  avg(value)
FROM default.table_one
SAMPLE 0.1;

query_duration_ms: 358
rows_read: 100,000,000
read_bytes: 1,600,000,000

Solution

  • Short answer: yes. Because CH needs to read one more column sample_hash.

    Long answer: Sampling is hard. It is useful if you have 100 billions rows per day and 400 servers. It helps with groupbys a lot. It's not helpful in filtering because in your case it does not work along with primary index. Yandex has designed sampling for themselves. They have enabled forced partitioningkeys/primarykeys usage (force_index_by_date/force_primary_key). So queries like yours are impossible in their system and thus sampling helps them even in disk reading.

    That's why I don't use Sampling in my systems.

    But

    ORDER BY (datestamp, timestamp, sample_hash)

    Also such ORDER BY is not useful at all. This whole table is misdesign. No sense to put (datestamp in the index prefix because the table is partitioned by datestamp so each partition has only ONE datestamp value.

    timestamp in the index prefix is an even bigger issue because it's very unwise to put a highly cardinal column in the beginning of a primary index.

    So. I can create a synthetic example and show how sampling works. But does it have any sense?

    CREATE TABLE table_one
    ( timestamp UInt64,
      transaction_id UInt64,
      banner_id UInt16,
      value UInt32
    )
    ENGINE = MergeTree()
    PARTITION BY toYYYYMMDD(toDateTime(timestamp))
    ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)),  cityHash64(transaction_id))
    SAMPLE BY cityHash64(transaction_id)
    SETTINGS index_granularity = 8192
    
    
    insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
    from numbers(10000000000);
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 11.490 sec. Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 1.316 sec. Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)
    
    
    
    select banner_id, sum(value), count(value), max(value)
    from table_one 
    WHERE banner_id = 42
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 0.020 sec. Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)
    
    select banner_id, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    WHERE banner_id = 42
    group by banner_id format Null;
    
    0 rows in set. Elapsed: 0.008 sec. Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)
    
    
    
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one 
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 36.660 sec. Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)
    
    select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
    from table_one SAMPLE 0.01
    group by banner_id, hr format Null;
    0 rows in set. Elapsed: 3.741 sec. Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)
    
    
    
    
    select count()
    from table_one 
    where value = 666 format Null;
    1 rows in set. Elapsed: 6.056 sec. Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)
    
    select count()
    from table_one  SAMPLE 0.01
    where value = 666 format Null;
    1 rows in set. Elapsed: 1.214 sec. Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)