Search code examples
sqlclickhousecorrelated-subquery

Outer Query Column cannot be used in Clickhouse SELECT


I'm trying to select rows with the max value based on a metric. So, each row should be "compared" to its own max value based on the metric of that row. This can be done with the following query in PostgreSQL: This SQL runs correctly in PostgreSQL:

DROP TABLE IF EXISTS test_max_date;
CREATE TABLE IF NOT EXISTS test_max_date
(
    timestamp   date,
    value       double precision,
    metric      text,
    accept_date date
);
INSERT INTO test_max_date (timestamp, value, metric, accept_date)
VALUES ('1990-01-01', 1.1, 'foo', '1999-02-01'), -- Set 1
       ('1990-02-01', 2.2, 'foo', '1999-02-01'),
       ('1990-03-01', 3.3, 'foo', '1999-02-01'),
       ('1990-01-01', 1.2, 'bar', '2021-02-01'), -- Set 2
       ('1990-02-01', 2.3, 'bar', '2021-02-01'),
       ('1990-03-01', 3.4, 'bar', '2021-02-01'),
       ('1990-01-01', 1.1, 'foo', '1999-04-01'), -- Set 3
       ('1990-02-01', 2.2, 'foo', '1999-04-01'),
       ('1990-03-01', 3.3, 'foo', '1999-04-01'),
       ('1990-01-01', 1.2, 'bar', '2022-02-01'), -- Set 4
       ('1990-02-01', 2.3, 'bar', '2022-02-01'),
       ('1990-03-01', 3.4, 'bar', '2022-02-01')
;

SELECT timestamp, value, metric, accept_date
FROM test_max_date tmd1
WHERE
  accept_date = (SELECT max(accept_date) FROM test_max_date  tmd2 WHERE tmd2.metric = tmd1.metric)

But, the outer query column cannot be used in the inner query WHERE clause in Clickhouse. The exact same SQL does not run in Clickhouse:

DROP TABLE IF EXISTS test_max_date;
CREATE TABLE IF NOT EXISTS test_max_date
(
    timestamp   date,
    value       double precision,
    metric      text,
    accept_date date
)
    ENGINE = ReplacingMergeTree
ORDER BY (metric, accept_date, timestamp);
INSERT INTO test_max_date (timestamp, value, metric, accept_date)
VALUES ('1990-01-01', 1.1, 'foo', '1999-02-01'),
       ('1990-02-01', 2.2, 'foo', '1999-02-01'),
       ('1990-03-01', 3.3, 'foo', '1999-02-01'),
       ('1990-01-01', 1.2, 'bar', '2021-02-01'),
       ('1990-02-01', 2.3, 'bar', '2021-02-01'),
       ('1990-03-01', 3.4, 'bar', '2021-02-01'),
       ('1990-01-01', 1.1, 'foo', '1999-04-01'),
       ('1990-02-01', 2.2, 'foo', '1999-04-01'),
       ('1990-03-01', 3.3, 'foo', '1999-04-01'),
       ('1990-01-01', 1.2, 'bar', '2022-02-01'),
       ('1990-02-01', 2.3, 'bar', '2022-02-01'),
       ('1990-03-01', 3.4, 'bar', '2022-02-01')
;

SELECT timestamp, value, metric, accept_date
FROM test_max_date tmd1
WHERE
  accept_date = (SELECT max(accept_date) FROM test_max_date  tmd2 WHERE tmd2.metric = tmd1.metric)

The error is:

Code: 47. DB::Exception: Missing columns: 'tmd1.metric' while processing query: 'SELECT max(accept_date) FROM test_max_date AS tmd2 WHERE metric = tmd1.metric', required columns: 'metric' 'tmd1.metric' 'accept_date', maybe you meant: ['metric','accept_date']: While processing (SELECT max(accept_date) FROM test_max_date AS tmd2 WHERE tmd2.metric = tmd1.metric) AS _subquery805: While processing accept_date = ((SELECT max(accept_date) FROM test_max_date AS tmd2 WHERE tmd2.metric = tmd1.metric) AS _subquery805). (UNKNOWN_IDENTIFIER) (version 22.12.1.1752 (official build)) , server ClickHouseNode [uri=http://localhost:8123/default, options={custom_http_params=session_id=DataGrip_d050598f-af8a-4c56-b9d4-e88fbec47962}]@-1760860067

I assume there's something different between the two, but I was unable to find anything about outer query column access in clickhouse.


Solution

  • 1)
    SELECT timestamp, value, metric, accept_date
    FROM test_max_date tmd1
    WHERE (accept_date,metric) in (SELECT max(accept_date),metric FROM test_max_date group by metric)
    order by 1,2
    ┌──timestamp─┬─value─┬─metric─┬─accept_date─┐
    │ 1990-01-01 │   1.1 │ foo    │  1999-04-01 │
    │ 1990-01-01 │   1.2 │ bar    │  2022-02-01 │
    │ 1990-02-01 │   2.2 │ foo    │  1999-04-01 │
    │ 1990-02-01 │   2.3 │ bar    │  2022-02-01 │
    │ 1990-03-01 │   3.3 │ foo    │  1999-04-01 │
    │ 1990-03-01 │   3.4 │ bar    │  2022-02-01 │
    └────────────┴───────┴────────┴─────────────┘
    
    2)
    SELECT timestamp, value, metric, accept_date 
    from ( SELECT timestamp, value, metric, accept_date,
              rank() over (partition by metric order by accept_date desc) r
          FROM test_max_date tmd1 ) t
    WHERE r=1
    order by 1,2
    ┌──timestamp─┬─value─┬─metric─┬─accept_date─┐
    │ 1990-01-01 │   1.1 │ foo    │  1999-04-01 │
    │ 1990-01-01 │   1.2 │ bar    │  2022-02-01 │
    │ 1990-02-01 │   2.2 │ foo    │  1999-04-01 │
    │ 1990-02-01 │   2.3 │ bar    │  2022-02-01 │
    │ 1990-03-01 │   3.3 │ foo    │  1999-04-01 │
    │ 1990-03-01 │   3.4 │ bar    │  2022-02-01 │
    └────────────┴───────┴────────┴─────────────┘
    
    3)
    select timestamp, value, metric, accept_date
    from (
        SELECT metric, accept_date, groupArray( (timestamp, value) ) ga
        FROM test_max_date
        group by metric, accept_date
        order by metric, accept_date desc 
        limit 1 by metric
        ) array join ga.1 as timestamp, ga.2 as value
    order by 1, 2
    ┌──timestamp─┬─value─┬─metric─┬─accept_date─┐
    │ 1990-01-01 │   1.1 │ foo    │  1999-04-01 │
    │ 1990-01-01 │   1.2 │ bar    │  2022-02-01 │
    │ 1990-02-01 │   2.2 │ foo    │  1999-04-01 │
    │ 1990-02-01 │   2.3 │ bar    │  2022-02-01 │
    │ 1990-03-01 │   3.3 │ foo    │  1999-04-01 │
    │ 1990-03-01 │   3.4 │ bar    │  2022-02-01 │
    └────────────┴───────┴────────┴─────────────┘