I have 2 Impala tables.
1st table T1 (additional columns are there but I am interested in only date and day type as weekday):
date day_type
04/01/2020 Weekday
04/02/2020 Weekday
04/03/2020 Weekday
04/04/2020 Weekend
04/05/2020 Weekend
04/06/2020 Weekday
2nd table T2:
process date status
A 04/01/2020 finished
A 04/02/2020 finished
A 04/03/2020 finished
A 04/03/2020 run_again
Using Impala queries I have to get the maximum date from second table T2 and get its status. According to the above table 04/03 is the maximum date.
If the status is finished on 04/03, then my query should return the next available weekday date from T1 which is 04/06/2020.
But if the status is run_again
, then the query should return the same date.
In the above table, 04/03 has run_again
and when my query runs the output should be 04/03/2020 and not 04/06/2020.
Please note more than one status is possible for a date. For example, 04/03/2020 can have a row with finished as status and another with run again as status. In this case run again should be prioritized and the query should give 04/03/2020 as output date
What I tried so far: I ran a subquery from second table and got the maximum date and its status. I tried to run a case in my main query and gave T1 as subselect in Case statement but its not working.
Is it possible to achieve this through Impala query?
One way to do this is to create a CTE from table T1 instead of a correlated subquery. Something like:
WITH T3 as (
select t.date date, min(x.date) next_workday
from T1 t join T1 x
on t.date < x.date
where x.day_type = 'Weekday'
group by t.date
)
select T2.process, T2.date run_date, T2.status,
case when T2.status = 'finished' then T3.next_workday
else T3.date
end next_run_date
from T2 join T3
on T2.date = T3.date
order by T2.process, T2.date;
+---------+------------+-----------+---------------+
| process | run_date | status | next_run_date |
+---------+------------+-----------+---------------+
| A | 2020-04-01 | finished | 2020-04-02 |
| A | 2020-04-02 | finished | 2020-04-03 |
| A | 2020-04-03 | run again | 2020-04-03 |
+---------+------------+-----------+---------------+
You can then select max from the result instead of ordering.