Search code examples
sqlquery-optimizationclickhouse

How can I join 2 tables together if columns I join on are of different data type?


I am new to SQL and ClickHouse especially. I am trying to join 2 tables based on customer id. The problem is the type of data in 2 tables that I am joining are different. The left table's column that i join on has data a type of string and the right table is int64 or Nullable. When I try to convert left table's customer ids with the type of string to int64 I get a timeout error. It seems that the server cannot parse this much data (even though I limit the data by specifying the date range)

Here's my query that results in timeout:

WITH
    t AS (
    SELECT ticketId, 
        toInt64(accId) AS accountId,
        datetimeCreated,
        os
    FROM(
        SELECT
            ticketId, 
            accId AS accountId,
            tsToDateTime(entryCreateTime) AS datetimeCreated, 
-- tsToDateTime is a custom function that parses timestamp into the date and time that human beings are used to
            os
        FROM myTab.table 
        PREWHERE 
            datetimeCreated BETWEEN toDateTime(today()) - INTERVAL 5 MONTH AND datetimeCreated AND 
            status = 'new' AND 
            role = 'end-user' AND
            project = 'pj' AND 
            replaceRegexpAll(accountId,'\\D', '') = accountId --need this line to exclude rubbish ids that have anything but digits
        GROUP BY 
            ticketId, 
            accId,
            datetimeCreated,
            os
)
        ),
    p AS (
    SELECT 
        tsToDateTime(ts) AS datetimePaid, accId,
        amount
    FROM otherTab.tab2  
    PREWHERE 
        datetimePaid >= today() - 365
    )
SELECT ticketId, acctId, datetimeCreated,
COUNT(amount) AS payms_cnt,
ROUND(SUM(amount)) AS payms_sum
FROM (
    SELECT 
        t.ticketId, 
        t.accId, 
        t.datetimeCreated,
        t.platform,
        p.datetimePaid,
        p.amount
FROM t
INNER JOIN p ON t.accId = p.accId
WHERE p.datetimePaid BETWEEN t.datetimeCreated - INTERVAL 4 MONTH AND t.datetimeCreated)
GROUP BY accId, ticketId, datetimeCreated, platform
ORDER BY accId, datetimeCreated 
;

I think I should also mention that the tables I am joining have around several tens of thousands of lines considering the daterange filter.

I have tried transforming accId from the right table (the p table) to string, however the result was the same (timeout). I checked these 2 table (p and t) separately and they work fine until I try joining them. I have also tried

FROM t
INNER JOIN p ON CAST(t.accId AS int) = p.accId

but still receive timeout:

SQL Error [159] [07000]: Read timed out, server ClickHouseNode

Solution

  • Does your data small after filter PREWHERE datetimePaid >= today() - 365 or WHERE p.datetimePaid BETWEEN t.datetimeCreated - INTERVAL 4 MONTH AND t.datetimeCreated? The latter can't be pushed down through the join since it depends on values from both tables, so it's executed after the join.

    In this case, the bottleneck is the size of p. You can check which filter is pushed down by using EXPLAIN PLAN actions = 1. Read it from bottom to top and look for Prewhere filter column or Filter column to see which filters are executed before the join.

    Also if t is smaller than p I can suggest rewriting query to have smaller table in the right, it's not done automatically.

    You mentioned a timeout — is this a client-side timeout? For large joins, ClickHouse usually returns a MEMORY_LIMIT_EXCEPTION since it loads the right table into memory.

    And you may consider using a different join algorithm that can save some data to disk, though it's more of a mitigation for memory issues: https://clickhouse.com/blog/clickhouse-fully-supports-joins-how-to-choose-the-right-algorithm-part5