Search code examples
sqlapache-sparkamazon-athenaprestotrino

FIRST_VALUE in Athena or Spark


select id
      ,id2
      ,FIRST_VALUE(CASE WHEN app THEN date0 ELSE NULL END) IGNORE NULLS  OVER (PARTITION BY id ORDER BY date0) as date_result
from (
select 1 id, 22 as id2, false app, Date'2019-03-13' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-14' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-15' as date0
)

Above query is returning like below in Athena

id id2 date_result
1 22
1 23 2019-03-14
1 23 2019-03-14

But I was expecting like below since we do ignore nulls and partition by id for date_result

id id2 date_result
1 22 2019-03-14
1 23 2019-03-14
1 23 2019-03-14

Could you please let me know what I am doing wrong in first_value? what is the best way to achieve this result in both Athena and spark? Thanks

I have added it in the description


Solution

  • Could you please let me know what I am doing wrong in first_value?

    default frame for windows functions is unbounded preceding - current row:

    If frame_end is not specified, a default value of CURRENT ROW is used.

    If no frame is specified, a default frame of RANGE UNBOUNDED PRECEDING is used.

    If you want to find value across the whole partition you need to specify the frame, for example:

    with dataset(id, id2, app, date0) as (
        values (1, 22, false, Date'2019-03-13'),
        (1, 23, true ,Date'2019-03-14'),
        (1, 23, true ,Date'2019-03-15')
    )
    
    select id
         , id2
         , FIRST_VALUE(if(app, date0)) IGNORE NULLS
            OVER (PARTITION BY id ORDER BY date0 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as date_result
    from dataset;