I have a need to do (in psuedo code)
where yyyy_mm_dd >= '2019-02-01'
and yyyy_mm_dd <= second highest date in a table
To achieve this, I've used this code:
where
p.yyyy_mm_dd >= "2019-02-02"
and p.yyyy_mm_dd <= (select max(yyyy_mm_dd) from schema.table1 where yyyy_mm_dd < (select max(yyyy_mm_dd) from schema.table1 where yyyy_mm_dd is not null))
The above works when it is wrapped in spark.sql()
but when I run the query without Spark i.e. as raw HQL, I run into this error:
Error while compiling statement: FAILED: ParseException line 102:25 cannot recognize input near 'select' 'max' '(' in expression specification
I tried to fix it by aliasing all columns in the subquery like this:
where
p.yyyy_mm_dd >= "2019-02-02"
and p.yyyy_mm_dd <= (select max(t1.yyyy_mm_dd) from schema.table1 t1 where t1.yyyy_mm_dd < (select max(t2.yyyy_mm_dd) from schema.table2 t2 where t2.yyyy_mm_dd is not null))
Though, I still run into the same error.
Edit to include sample data and query:
table1:
| yyyy_mm_dd | company_id | account_manager |
|------------|------------|-----------------|
| 2020-11-10 | 321 | Peter |
| 2020-11-09 | 632 | John |
| 2020-11-08 | 598 | Doe |
| 2020-11-07 | 104 | Bob |
| ... | ... | ... |
| ... | ... | ... |
table2:
| yyyy_mm_dd | company_id | tier |
|-------------------|------------|--------|
| 2020-11-10 | 321 | Bronze |
| 2020-11-09 | 632 | Silver |
| 2020-11-08 | 598 | Gold |
| 2020-11-07 | 104 | Bob |
| ... | ... | ... |
| ... | ... | ... |
| 2019_12_13_backup | 321 | Bronze |
| 2019_12_13_backup | 632 | Silver |
| ... | | |
Query:
select
p.yyyy_mm_dd,
p.company_id,
p.account_manager,
t.tier
from
table1 p
left join(
select
yyyy_mm_dd,
company_id,
max(tier) as tier
from
table2
where
yyyy_mm_dd >= "2019-02-02"
group by
1,2
) t on (t.company_id = p.company_id and t.yyyy_mm_dd = p.yyyy_mm_dd)
where
p.yyyy_mm_dd >= "2019-02-02"
and p.yyyy_mm_dd <= (select max(yyyy_mm_dd) from table2 where yyyy_mm_dd < (select max(yyyy_mm_dd) from table2 where yyyy_mm_dd is not null))
As table2
contains backup_2019_12_31
in the yyyy_mm_dd
column, those rows will be returned when doing max()
on the table. So I need to get the second highest value, which from the dataset here would be 2020-11-10
. There are multiple company_ids
per yyyy_mm_dd
.
In essence, I want to query table1
where yyyy_mm_dd
is between table1
starting point (hardcoded as 2019-02-02
) and the true max date from table2
To get the second highest date from table3 you can use dense_rank. All rows with second highest date will be assigned rn=2. Use LIMIT to get single row or use max() or distinct aggregation for the same, then cross join your table with max_date and filter.
with max_date as(
select yyyy_mm_dd
from
(
select yyyy_mm_dd,
dense_rank() over(order by yyyy_mm_dd desc) rn
from table2
)s
where rn=2 --second max date
limit 1 --need only one record
)
select t1.*
from table1 t1
cross join max_date t2
where t1.yyyy_mm_dd <= t2.yyyy_mm_dd