Search code examples
sqlamazon-web-servicesamazon-athenaprestotrino

Using columns from query in subquery in Athena


I have the following query

SELECT 
  t1.id,
  t1.year, 
  t1.month, 
  t1.day,
  (CASE WHEN (NOT (t1.id IN (SELECT DISTINCT t2.id
                             FROM table2 t2
                             WHERE (to_date(cast(t2.year as varchar) || LPAD(cast(t2.month as varchar), 2, '0') || LPAD(cast(t2.day as varchar), 2, '0'),'yyyyMMdd') 
                                 BETWEEN (to_date(cast(t1.year as varchar) || LPAD(cast(t1.month as varchar), 2, '0') || LPAD(cast(t1.day as varchar), 2, '0'),'yyyyMMdd') - INTERVAL  '12' MONTH) 
                                   AND to_date(cast(t1.year as varchar) || LPAD(cast(t1.month as varchar), 2, '0') || LPAD(cast(t1.day as varchar), 2, '0'),'yyyyMMdd'))))) THEN 'Something' ELSE 'Something else' END) action
FROM
table1 t1

Now the problem is I am using t1.year, t1.month, t1.day inside the subquery but athena throws me

Given correlated subquery is not supported

How can I access the columns from main query in the subquery without getting the error in Athena?


Solution

  • As mentioned in the docs correlated subqueries support is quite limited in Presto/Trino (SQL engine Athena is based on):

    Support for correlated subqueries is limited. Not every standard form is supported.

    You can try working around via join and group by. If the combination of selected fields from t1 is unique then you can follow it with group by to get the results:

    -- sample data
    WITH table_1(id, year) as (
        values (1, 1),
            (2, 2),
            (3, 3)
    ),
    table_2(id, year) as (
        values (1, 1),
            (1, 2),
            (1, 3),
            (2, -101)
    )
    
    -- query
    select t1.id, t1.year, if(max(t2.id) is not null, 'something', 'something else')
    from table_1 t1
    left join table_2 t2 on t1.id = t2.id and t2.year between t1.year - 1 and t1.year + 1
    group by t1.id, t1.year;
    

    Output:

    id year _col2
    2 2 something else
    1 1 something
    3 3 something else

    Otherwise you will need to use synthesized unique id (for example using row_number window function). Something along these lines:

    select arbitrary((t1.id, t1.year)).*, -- Trino ROW flattening trick to reduce amount of code
           if(max(t2.id) is not null, 'something', 'something else')
    from (
        select *, 
               row_number() over() as unique_id 
        from table_1) t1
    left join table_2 t2 
        on t1.id = t2.id and t2.year between t1.year - 1 and t1.year + 1
    group by t1.unique_id;