Search code examples
sqlsql-likesamplingclickhouse

Clickhouse: is there a way to set up sampling on an existing MergeTree that doesnt have a primary key in clickhouse?


I want to do random sampling on this table I have, it contains data that registers users' session activity. So there is a userid, a sessionid, timestamps also other general data. Because of how it was organized, there isn't a single column that is unique since a user can have multiple sessions or a session can be accessed by multiple users, etc.

there are around 300M rows in this table and I wanted to be able to sample it. However, looking at the documentation, it looks like I have to set a hashing key or a sampling key, is it possible to do that when my columns are unique? or is it possible to hash it over multiple columns?

I have a query such as

select * from table order by rand() limit (select toInt32(count(*)/100) from table)

but it's too demanding to run over the 300M rows


Solution

  • it is possible to hash it over multiple columns

    SELECT cityHash64('axx', 20)
    
    ┌─cityHash64('axx', 20)─┐
    │   6142436227093532237 │
    └───────────────────────┘
    

    Sampling is hard thing to implement properly, please read this https://kb.altinity.com/altinity-kb-queries-and-syntax/sampling-example