I confused with Materialized View, Suppose I have following Table:
CREATE TABLE IF NOT EXISTS analytics.Records
(
-- User Checksum
CidUserChecksum FixedString(24),
-- Session Checksum
CidSessionChecksum FixedString(24),
-- requirements
IP IPv4,
PublicInstanceID String,
Created Datetime
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/analytics/Records', '{replica}')
ORDER BY (Created, Mode, xxHash32(PublicInstanceID))
PARTITION BY toYYYYMM(Created)
SAMPLE BY (xxHash32(PublicInstanceID));
I want to fill following aggregate table with MV:
CREATE TABLE IF NOT EXISTS analytics.VisitsHourly
(
PublicInstanceID String,
Hour DateTime,
UsersCount UInt64,
SessionsCount UInt64,
PageviewsCount UInt64
)
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/analytics/VisitsHourly', '{replica}')
PARTITION BY toYYYYMM(Hour)
ORDER BY (Hour, PublicInstanceID);
The query to retrieve this fields is:
SELECT
PublicInstanceID,
toStartOfHour(Created) as Hour,
countDistinct(CidUserChecksum) as UsersCount,
countDistinct(CidSessionChecksum) as SessionsCount,
count() as PageviewsCount
FROM analytics.Records
GROUP BY PublicInstanceID, toStartOfHour(Created);
following query for Creating MV is not Correct, and I cannot understand why!
CREATE MATERIALIZED VIEW IF NOT EXISTS analytics.VisitsHourlyMV to analytics.VisitsHourly
AS SELECT
PublicInstanceID,
toStartOfHour(Created) as Hour,
countDistinct(CidUserChecksum) as UsersCount,
countDistinct(CidSessionChecksum) as SessionsCount,
count() as PageviewsCount
FROM analytics.Records
GROUP BY PublicInstanceID, toStartOfHour(Created);
What is the correct version of VisitsHourlyMV? what is the problem?
It's complicated.
countDistinct is very heavy function which calculates exact number and it needs raw data. There is no sense to use MatView for countDistinct (uniqExact) because the size of the matView will be the same as a source table.
Clickhouse has several functions to calculate approximate countDistinct (uniq) using HyperLogLog algo
uniqHLL12, uniq, uniqCombined, uniqCombined64
Here is some comparison of the stored size and QPS https://kb.altinity.com/altinity-kb-schema-design/uniq-functions/
Approximation error is slightly different but insignificant in the most cases.
I suggest to use uniqHLL12
. It will save a ton of disk space and ultra-fast.
select countDistinct(number) from numbers(1e9);
┌─uniqExact(number)─┐
│ 1000000000 │
└───────────────────┘
1 row in set. Elapsed: 66.012 sec. Processed 1.00 billion rows, 8.00 GB (15.15 million rows/s., 121.19 MB/s.)
select uniqHLL12(number) from numbers(1e9);
┌─uniqHLL12(number)─┐
│ 1000307678 │
└───────────────────┘
1 row in set. Elapsed: 9.671 sec. Processed 1.00 billion rows, 8.00 GB (103.40 million rows/s., 827.23 MB/s.)
As you can see uniqHLL12 6 times faster (stores 1.000.0000 times less data, no joke), and has a small error -~ 0.03% (in this case, mileage can vary). With MatViews the difference is move obvious.
https://clickhouse.com/docs/en/sql-reference/data-types/aggregatefunction
To store uniqs you have to AggregateFunction and states https://altinity.com/blog/2017/7/10/clickhouse-aggregatefunctions-and-aggregatestate
https://medium.com/@f1yegor/clickhouse-aggregatefunctions-and-aggregatestate-e3fd46b7be74
https://clickhouse.com/blog/using-materialized-views-in-clickhouse
https://engineering.oden.io/blog/how-to-write-a-clickhouse-aggregate-function
CREATE TABLE IF NOT EXISTS analytics.Records
(
CidUserChecksum FixedString(24),
CidSessionChecksum FixedString(24),
IP IPv4,
PublicInstanceID String,
Created Datetime
)
ENGINE = MergeTree
ORDER BY (Created, xxHash32(PublicInstanceID))
PARTITION BY toYYYYMM(Created)
SAMPLE BY (xxHash32(PublicInstanceID));
CREATE TABLE IF NOT EXISTS analytics.VisitsHourly
(
PublicInstanceID String,
Hour DateTime,
UsersCount AggregateFunction(uniqHLL12, FixedString(24)),
SessionsCount AggregateFunction(uniqHLL12, FixedString(24)),
PageviewsCount SimpleAggregateFunction(sum,UInt64)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(Hour)
ORDER BY (Hour, PublicInstanceID);
CREATE MATERIALIZED VIEW IF NOT EXISTS analytics.VisitsHourlyMV to analytics.VisitsHourly
AS SELECT
PublicInstanceID,
toStartOfHour(Created) as Hour,
uniqHLL12State(CidUserChecksum) as UsersCount,
uniqHLL12State(CidSessionChecksum) as SessionsCount,
count() as PageviewsCount
FROM analytics.Records
GROUP BY PublicInstanceID, toStartOfHour(Created);
insert into analytics.Records
select toFixedString(toString(cityHash64(number)),24),
toFixedString(toString(cityHash64(number+1)),24),
'0.0.0.0', number%11111, today() from numbers(10000000);
SELECT
countDistinct(CidUserChecksum) AS UsersCount,
countDistinct(CidSessionChecksum) AS SessionsCount,
count() AS PageviewsCount
FROM analytics.Records
┌─UsersCount─┬─SessionsCount─┬─PageviewsCount─┐
│ 10000000 │ 10000000 │ 10000000 │
└────────────┴───────────────┴────────────────┘
1 row in set. Elapsed: 1.593 sec. Processed 10.00 million rows, 480.01 MB (6.28 million rows/s., 301.26 MB/s.)
SELECT
uniqHLL12Merge(UsersCount) AS UsersCount,
uniqHLL12Merge(SessionsCount) AS SessionsCount,
sum(PageviewsCount) AS PageviewsCount
FROM analytics.VisitsHourly
┌─UsersCount─┬─SessionsCount─┬─PageviewsCount─┐
│ 10380374 │ 10380374 │ 10000000 │
└────────────┴───────────────┴────────────────┘
1 row in set. Elapsed: 0.197 sec. Processed 11.11 thousand rows, 4.45 MB (56.50 thousand rows/s., 22.64 MB/s.)
You need to understand the role of -State and -Merge combinators. AggregateFunction is the one of core conception of Clickhouse and very important conception. AggregateFunction is the nature of Clickhouse, distributed queries, multithread calculations are based on AggregateFunction.