Search code examples
clickhouse

Clickhouse Materialized View on ReplicatedAggregatingMergeTree


I confused with Materialized View, Suppose I have following Table:

CREATE TABLE IF NOT EXISTS analytics.Records
(
  -- User Checksum
  CidUserChecksum                             FixedString(24),
  
  -- Session Checksum
  CidSessionChecksum                          FixedString(24),

  -- requirements
  IP                                          IPv4,
  PublicInstanceID                            String,
  Created                                     Datetime
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/analytics/Records', '{replica}')
ORDER BY (Created, Mode, xxHash32(PublicInstanceID))
PARTITION BY toYYYYMM(Created)
SAMPLE BY (xxHash32(PublicInstanceID));

I want to fill following aggregate table with MV:

CREATE TABLE IF NOT EXISTS analytics.VisitsHourly
(
  PublicInstanceID   String,
  Hour               DateTime,
  UsersCount         UInt64,
  SessionsCount      UInt64,
  PageviewsCount     UInt64
)
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/analytics/VisitsHourly', '{replica}')
PARTITION BY toYYYYMM(Hour)
ORDER BY (Hour, PublicInstanceID);

The query to retrieve this fields is:

SELECT 
    PublicInstanceID,
    toStartOfHour(Created) as Hour,
    countDistinct(CidUserChecksum) as UsersCount,
    countDistinct(CidSessionChecksum) as SessionsCount,
    count() as PageviewsCount
FROM analytics.Records
GROUP BY PublicInstanceID, toStartOfHour(Created);

following query for Creating MV is not Correct, and I cannot understand why!

CREATE MATERIALIZED VIEW IF NOT EXISTS analytics.VisitsHourlyMV to analytics.VisitsHourly
AS SELECT 
    PublicInstanceID,
    toStartOfHour(Created) as Hour,
    countDistinct(CidUserChecksum) as UsersCount,
    countDistinct(CidSessionChecksum) as SessionsCount,
    count() as PageviewsCount
FROM analytics.Records
GROUP BY PublicInstanceID, toStartOfHour(Created);

What is the correct version of VisitsHourlyMV? what is the problem?


Solution

  • It's complicated.

    UniqExact (countDistinct) vs HyperLogLog

    countDistinct is very heavy function which calculates exact number and it needs raw data. There is no sense to use MatView for countDistinct (uniqExact) because the size of the matView will be the same as a source table.

    Clickhouse has several functions to calculate approximate countDistinct (uniq) using HyperLogLog algo

    uniqHLL12, uniq, uniqCombined, uniqCombined64

    Here is some comparison of the stored size and QPS https://kb.altinity.com/altinity-kb-schema-design/uniq-functions/

    Approximation error is slightly different but insignificant in the most cases.

    I suggest to use uniqHLL12. It will save a ton of disk space and ultra-fast.

    select countDistinct(number) from numbers(1e9);
    ┌─uniqExact(number)─┐
    │        1000000000 │
    └───────────────────┘
    1 row in set. Elapsed: 66.012 sec. Processed 1.00 billion rows, 8.00 GB (15.15 million rows/s., 121.19 MB/s.)
    
    select uniqHLL12(number) from numbers(1e9);
    ┌─uniqHLL12(number)─┐
    │        1000307678 │
    └───────────────────┘
    1 row in set. Elapsed: 9.671 sec. Processed 1.00 billion rows, 8.00 GB (103.40 million rows/s., 827.23 MB/s.)
    

    As you can see uniqHLL12 6 times faster (stores 1.000.0000 times less data, no joke), and has a small error -~ 0.03% (in this case, mileage can vary). With MatViews the difference is move obvious.


    AggregateFunction

    https://clickhouse.com/docs/en/sql-reference/data-types/aggregatefunction

    To store uniqs you have to AggregateFunction and states https://altinity.com/blog/2017/7/10/clickhouse-aggregatefunctions-and-aggregatestate

    https://medium.com/@f1yegor/clickhouse-aggregatefunctions-and-aggregatestate-e3fd46b7be74

    https://clickhouse.com/blog/using-materialized-views-in-clickhouse

    https://engineering.oden.io/blog/how-to-write-a-clickhouse-aggregate-function


    Example

    CREATE TABLE IF NOT EXISTS analytics.Records
    (
      CidUserChecksum                             FixedString(24),
      CidSessionChecksum                          FixedString(24),
      IP                                          IPv4,
      PublicInstanceID                            String,
      Created                                     Datetime
    )
    ENGINE = MergeTree
    ORDER BY (Created,  xxHash32(PublicInstanceID))
    PARTITION BY toYYYYMM(Created)
    SAMPLE BY (xxHash32(PublicInstanceID));
    
    CREATE TABLE IF NOT EXISTS analytics.VisitsHourly
    (
      PublicInstanceID   String,
      Hour               DateTime,
      UsersCount         AggregateFunction(uniqHLL12, FixedString(24)),
      SessionsCount      AggregateFunction(uniqHLL12, FixedString(24)),
      PageviewsCount     SimpleAggregateFunction(sum,UInt64)
    )
    ENGINE =  AggregatingMergeTree
    PARTITION BY toYYYYMM(Hour)
    ORDER BY (Hour, PublicInstanceID);
    
    CREATE MATERIALIZED VIEW IF NOT EXISTS analytics.VisitsHourlyMV to analytics.VisitsHourly
    AS SELECT 
        PublicInstanceID,
        toStartOfHour(Created) as Hour,
        uniqHLL12State(CidUserChecksum) as UsersCount,
        uniqHLL12State(CidSessionChecksum) as SessionsCount,
        count() as PageviewsCount
    FROM analytics.Records
    GROUP BY PublicInstanceID, toStartOfHour(Created);
    
    insert into analytics.Records
    select toFixedString(toString(cityHash64(number)),24), 
           toFixedString(toString(cityHash64(number+1)),24),
           '0.0.0.0', number%11111, today() from numbers(10000000);
    
    SELECT
        countDistinct(CidUserChecksum) AS UsersCount,
        countDistinct(CidSessionChecksum) AS SessionsCount,
        count() AS PageviewsCount
    FROM analytics.Records
    ┌─UsersCount─┬─SessionsCount─┬─PageviewsCount─┐
    │   10000000 │      10000000 │       10000000 │
    └────────────┴───────────────┴────────────────┘
    
    1 row in set. Elapsed: 1.593 sec. Processed 10.00 million rows, 480.01 MB (6.28 million rows/s., 301.26 MB/s.)
    
    
    SELECT
        uniqHLL12Merge(UsersCount) AS UsersCount,
        uniqHLL12Merge(SessionsCount) AS SessionsCount,
        sum(PageviewsCount) AS PageviewsCount
    FROM analytics.VisitsHourly
    ┌─UsersCount─┬─SessionsCount─┬─PageviewsCount─┐
    │   10380374 │      10380374 │       10000000 │
    └────────────┴───────────────┴────────────────┘
    1 row in set. Elapsed: 0.197 sec. Processed 11.11 thousand rows, 4.45 MB (56.50 thousand rows/s., 22.64 MB/s.)
    

    You need to understand the role of -State and -Merge combinators. AggregateFunction is the one of core conception of Clickhouse and very important conception. AggregateFunction is the nature of Clickhouse, distributed queries, multithread calculations are based on AggregateFunction.