Search code examples
hadoophivehdfsimpalacase-statement

Impala Query to get next date


I have 2 Impala tables.

1st table T1 (additional columns are there but I am interested in only date and day type as weekday):

date       day_type
04/01/2020 Weekday
04/02/2020 Weekday
04/03/2020 Weekday
04/04/2020 Weekend
04/05/2020 Weekend
04/06/2020 Weekday

2nd table T2:

process date       status
A       04/01/2020 finished
A       04/02/2020 finished
A       04/03/2020 finished
A       04/03/2020 run_again

Using Impala queries I have to get the maximum date from second table T2 and get its status. According to the above table 04/03 is the maximum date. If the status is finished on 04/03, then my query should return the next available weekday date from T1 which is 04/06/2020. But if the status is run_again, then the query should return the same date. In the above table, 04/03 has run_again and when my query runs the output should be 04/03/2020 and not 04/06/2020. Please note more than one status is possible for a date. For example, 04/03/2020 can have a row with finished as status and another with run again as status. In this case run again should be prioritized and the query should give 04/03/2020 as output date

What I tried so far: I ran a subquery from second table and got the maximum date and its status. I tried to run a case in my main query and gave T1 as subselect in Case statement but its not working.

Is it possible to achieve this through Impala query?


Solution

  • One way to do this is to create a CTE from table T1 instead of a correlated subquery. Something like:

    WITH T3 as (
      select t.date date, min(x.date) next_workday
      from T1 t join T1 x
      on t.date < x.date
      where x.day_type = 'Weekday'
      group by t.date
    )
    select T2.process, T2.date run_date, T2.status,
      case when T2.status = 'finished' then T3.next_workday
      else T3.date
      end next_run_date
    from T2 join T3
    on T2.date = T3.date
    order by T2.process, T2.date;
    +---------+------------+-----------+---------------+
    | process | run_date   | status    | next_run_date |
    +---------+------------+-----------+---------------+
    | A       | 2020-04-01 | finished  | 2020-04-02    |
    | A       | 2020-04-02 | finished  | 2020-04-03    |
    | A       | 2020-04-03 | run again | 2020-04-03    |
    +---------+------------+-----------+---------------+
    

    You can then select max from the result instead of ordering.