I am new to SQL and ClickHouse especially. I am trying to join 2 tables based on customer id. The problem is the type of data in 2 tables that I am joining are different. The left table's column that i join on has data a type of string and the right table is int64 or Nullable. When I try to convert left table's customer ids with the type of string to int64 I get a timeout error. It seems that the server cannot parse this much data (even though I limit the data by specifying the date range)
Here's my query that results in timeout:
WITH
t AS (
SELECT ticketId,
toInt64(accId) AS accountId,
datetimeCreated,
os
FROM(
SELECT
ticketId,
accId AS accountId,
tsToDateTime(entryCreateTime) AS datetimeCreated,
-- tsToDateTime is a custom function that parses timestamp into the date and time that human beings are used to
os
FROM myTab.table
PREWHERE
datetimeCreated BETWEEN toDateTime(today()) - INTERVAL 5 MONTH AND datetimeCreated AND
status = 'new' AND
role = 'end-user' AND
project = 'pj' AND
replaceRegexpAll(accountId,'\\D', '') = accountId --need this line to exclude rubbish ids that have anything but digits
GROUP BY
ticketId,
accId,
datetimeCreated,
os
)
),
p AS (
SELECT
tsToDateTime(ts) AS datetimePaid, accId,
amount
FROM otherTab.tab2
PREWHERE
datetimePaid >= today() - 365
)
SELECT ticketId, acctId, datetimeCreated,
COUNT(amount) AS payms_cnt,
ROUND(SUM(amount)) AS payms_sum
FROM (
SELECT
t.ticketId,
t.accId,
t.datetimeCreated,
t.platform,
p.datetimePaid,
p.amount
FROM t
INNER JOIN p ON t.accId = p.accId
WHERE p.datetimePaid BETWEEN t.datetimeCreated - INTERVAL 4 MONTH AND t.datetimeCreated)
GROUP BY accId, ticketId, datetimeCreated, platform
ORDER BY accId, datetimeCreated
;
I think I should also mention that the tables I am joining have around several tens of thousands of lines considering the daterange filter.
I have tried transforming accId from the right table (the p table) to string, however the result was the same (timeout). I checked these 2 table (p and t) separately and they work fine until I try joining them. I have also tried
FROM t
INNER JOIN p ON CAST(t.accId AS int) = p.accId
but still receive timeout:
SQL Error [159] [07000]: Read timed out, server ClickHouseNode
Does your data small after filter PREWHERE datetimePaid >= today() - 365
or WHERE p.datetimePaid BETWEEN t.datetimeCreated - INTERVAL 4 MONTH AND t.datetimeCreated
? The latter can't be pushed down through the join since it depends on values from both tables, so it's executed after the join.
In this case, the bottleneck is the size of p
. You can check which filter is pushed down by using EXPLAIN PLAN actions = 1
. Read it from bottom to top and look for Prewhere filter column
or Filter column
to see which filters are executed before the join.
Also if t
is smaller than p
I can suggest rewriting query to have smaller table in the right, it's not done automatically.
You mentioned a timeout — is this a client-side timeout? For large joins, ClickHouse usually returns a MEMORY_LIMIT_EXCEPTION
since it loads the right table into memory.
And you may consider using a different join algorithm that can save some data to disk, though it's more of a mitigation for memory issues: https://clickhouse.com/blog/clickhouse-fully-supports-joins-how-to-choose-the-right-algorithm-part5