I want to do random sampling on this table I have, it contains data that registers users' session activity. So there is a userid, a sessionid, timestamps also other general data. Because of how it was organized, there isn't a single column that is unique since a user can have multiple sessions or a session can be accessed by multiple users, etc.
there are around 300M rows in this table and I wanted to be able to sample it. However, looking at the documentation, it looks like I have to set a hashing key or a sampling key, is it possible to do that when my columns are unique? or is it possible to hash it over multiple columns?
I have a query such as
select * from table order by rand() limit (select toInt32(count(*)/100) from table)
but it's too demanding to run over the 300M rows
it is possible to hash it over multiple columns
SELECT cityHash64('axx', 20)
┌─cityHash64('axx', 20)─┐
│ 6142436227093532237 │
└───────────────────────┘
Sampling is hard thing to implement properly, please read this https://kb.altinity.com/altinity-kb-queries-and-syntax/sampling-example