I got this tables
seen_ids
table with ids.source_ids
null engine table where data comestarget_ids
with fresh ids that was not in seen_ids on the moment when it comes.I need to store only ids that are not in seen_ids:
CREATE materialized view mv_fresh
TO target_ids
AS SELECT
*
FROM source_ids where id not in (select id from seen_ids)
Also Id
adds to seen_ids
, so then it comes next time it is not added to target_ids
.
Other option is to use LEFT JOIN, may be it is also not good idea for large table.
This is example of what I need to do, what ways do you recommend to check on "not seen recently" ids? This example is probably bad design for large seen_ids
table (50 mln rows) and large income data.
Some ways to improve perf:
CREATE materialized view mv_fresh
TO target_ids
AS SELECT
..
FROM source_ids
where id not in (
select id from seen_ids
where id IN (select id FROM source_ids))
db.seen_ids
or use skipping indexesCREATE TABLE db.seen_ids_local ON CLUSTER cl
(..)
ORDER BY id
ENGINE = MergeTree()
SETTINGS index_granularity = 256 or 512 or ..;
For example, if id
is incremental then can be applied some condition WHERE id > {some_value_that_calculated_by_MV}
Make sense to consider using a table with SET engine to store seen IDs and use it in IN operator.
It needs to
db.source_ids
For example, for 2 shards on each one be proceeded 50 / 2 = 25 mln
rows of seen_ids
table.
CREATE TABLE db.seen_ids_local ON CLUSTER cl
(..)
ORDER BY id
ENGINE = MergeTree();
/* ! Write data to distributed table to reshard them among shards. Or write data directly to shard based on sharding key. */
CREATE TABLE db.seen_ids ON CLUSTER cl AS db.seen_ids_local
ENGINE = Distributed(cl, db, seen_ids_local, intHash32(id)); /* Reshard data by 'id'. */
CREATE TABLE db.source_ids_local ON CLUSTER cl
(..)
ENGINE = Null;
/* ! Write data to distributed table to reshard them among shards. */
CREATE TABLE db.source_ids ON CLUSTER cl AS db.source_ids_local
ENGINE = Distributed(cl, db, source_ids_local, intHash32(id)); /* Reshard data by 'id'. The same sharding key as for 'seen_ids' */
CREATE TABLE db.target_ids_local ON CLUSTER cl
(..)
ENGINE = MergeTree();
CREATE materialized view db.target_ids_local_mv
TO db.target_ids_local /* [Optional] Write data to local table */
AS SELECT
..
FROM db.source_ids_local /* !! Load data from local table */
where id not in (
select id from db.seen_ids_local /* !! Load data from local table */
)
Look at these docs: Altinity Knowledge Base.