Search code examples
sqlhiveqlgreatest-n-per-groupgaps-and-islands

How would I extract only the latest week from a select over statement in Hiveql?


I need some help, I've created a query which keeps a running total of whether an element returns a 1 or 0 against a specific measure with the running total returning to 0 if the measure provides a 0, Example below:

year_week  element  measure  running_total
2020_40    A        1        1
2020_41    A        1        2
2020_42    A        1        3
2020_43    A        0        0
2020_44    A        1        1
2020_45    A        1        2
2020_40    B        1        1
2020_41    B        1        2
2020_42    B        1        3
2020_43    B        1        4
2020_44    B        1        5
2020_45    B        1        6

The above is achieved using this query:

SELECT element,
       year_week,
       measure,
       SUM(measure) OVER (PARTITION BY element, flag_sum ORDER BY year_week ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
    FROM (
      SELECT *,
          SUM(measure_flag) OVER (PARTITION BY element ORDER BY year_week ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS flag_sum
      FROM (
        SELECT *,
            CASE WHEN  measure = 1 THEN 0 ELSE 1 END AS measure_flag
        FROM database.table ) x ) y

This is great and works - but I'd like to provide only the latest weeks data for each element. So in the above example it would be:

year_week  element  measure  running_total
2020_45    A        1        2
2020_45    B        1        6

Essentially I need to keep the logic the same but limit the returned data set. I've attempted this however it changes the result from the correct running total to a 1 or 0.

Any help is greatly appreciated!


Solution

  • You can add another level of nesting, and filter the latest record per element with row_number().

    I would suggest:

    select element, year_week, measure, running_total
    from (
        select t.*,
            row_number() over(partition by element, grp order by year_week) - 1 as running_total
        from (
            select t.*,
                sum(1 - measure) over(partition by element order by year_week) as grp,
                row_number() over(partition by element order by year_week desc) as rn
            from mytable t
        ) t
    ) t
    where rn = 1
    

    I simplified the query a little, considering the fact that measure has values 0 and 1 only, as showed in your sample data. If that's not the case, then:

    select element, year_week, measure, running_total
    from (
        select t.*,
            sum(measure) over(partition by element, grp order by year_week) as running_total
        from (
            select t.*,
                sum(case when measure = 0 then 1 else 0 end) over(partition by element order by year_week) as grp,
                row_number() over(partition by element order by year_week desc) as rn
            from mytable t
        ) t
    ) t
    where rn = 1