Search code examples
mysqlhivequery-optimizationhiveqlimpala

How to retrieve the values of the table for the max value of two columns in sql


I am trying to query a table by taking the maximum values from two different date columns, and output's all the records that have maximum of both the dates

The table has 6 columns which include st_id(string)(there are multiple entries of the same id), as_of_dt(int) and ld_dt_ts(timestamp). From this table, I am trying to get the max value of as_of_dt and ld_dt_ts and group by st_id and display all the records.

This works perfectly, but its not really optimal

SELECT A.st_id, A.fl_vw, A.tr_record FROM db.tablename A 
INNER JOIN (
    SELECT st_id, max(as_of_dt) AS as_of_dt, max(ld_dt_ts) AS ld_dt_ts 
    From db.tablename 
    group by st_id
) B on A.st_id = B.st_id and A.as_of_dt = B.as_of_dt and A.ld_dt_ts= B.ld_dt_ts

--

The expected result should return the st_id that has the maximum of both as_of_dt and ld_dt_ts i.e., which will be the latest record for each st_id.


Solution

  • Use analytic rank() function. rank() will assign 1 to all records with max date in the st_id partition:

    SELECT s.st_id, s.fl_vw, s.tr_record
    from
    (
    SELECT A.st_id, A.fl_vw, A.tr_record,
           rank() over(partition by st_id order by as_of_dt desc) rnk_as_of_dt,
           rank() over(partition by st_id order by ld_dt_ts desc) rnk_ld_dt_tsrnk
      FROM db.tablename A 
    )s 
    WHERE rnk_as_of_dt=1 ANDrnk=1 rnk_ld_dt_ts=1 --get records with max dates in both columns
    

    Two ranks can be possibly combined like this:

    SELECT s.st_id, s.fl_vw, s.tr_record
    from
    (
    SELECT A.st_id, A.fl_vw, A.tr_record,
           rank() over(partition by st_id order by as_of_dt desc, ld_dt_ts desc) rnk
      FROM db.tablename A 
    )s 
    WHERE rnk=1  --get records with max dates combination 
    

    but this is not exactly the same as your original query. For example if you have this dataset:

    st_id, as_of_dt, ld_dt_ts 
    1       1         2
    1       2         1
    

    Then this query

    SELECT st_id, max(as_of_dt) AS as_of_dt, max(ld_dt_ts) AS ld_dt_ts 
        From db.tablename 
        group by st_id
    

    will return:

    st_id, as_of_dt, ld_dt_ts 
    1       2         2
    

    And final join will not return any rows, because no any row with such combination exist, while query with two ranks combined will return:

    st_id, as_of_dt, ld_dt_ts 
    1       2         1
    

    If such dataset can not exist in your data (say, ld_dt_ts is always>=as_of_dt), then you possibly can combine ranks into single one or even use only one date in the order by.