Search code examples
sqlapache-spark-sql

How can i bring a previous value over a group using lag?


I have the following data:

ID Date
1 2023-10-05
2 2023-10-05
3 2023-10-05
4 2023-10-05
1 2023-10-10
2 2023-10-10
3 2023-10-10
4 2023-10-10

What i'd like to see is

ID Date Previous Date
1 2023-10-05 null
2 2023-10-05 null
3 2023-10-05 null
4 2023-10-05 null
1 2023-10-10 2023-10-05
2 2023-10-10 2023-10-05
3 2023-10-10 2023-10-05
4 2023-10-10 2023-10-05

This is what I've tried:

select ID, Date, lag(Date) over (partition by Date order by Date)
from Table;

But the result just shows the date of the preceeding row, not the preceeding group (indicated in the PARTITION BY portion), so i get this result:

ID Date Previous Date
1 2023-10-05 null
2 2023-10-05 2023-10-05
3 2023-10-05 2023-10-05
4 2023-10-05 2023-10-05
1 2023-10-10 2023-10-05
2 2023-10-10 2023-10-10
3 2023-10-10 2023-10-10
4 2023-10-10 2023-10-10

How can I show the previous date by group ignoring ID in the same query without the need of subqueries or joins using SQL?

EDIT *********

I tried @The Impaler's answer below, it is working and bringing the correct values for most of the rows, however this is where it's failing:

Let's say we have ID from 1 to 4 for date 2023-10-05, and then for the next date 2023-10-10 we have ID from 1 to 6, so the partition by ID will only work if ID values repeat between month, but not for new IDs in that month not present in the previous month:

ID Date Previous Date
1 2023-10-05 null
2 2023-10-05 null
3 2023-10-05 null
4 2023-10-05 null
1 2023-10-10 2023-10-05
2 2023-10-10 2023-10-05
3 2023-10-10 2023-10-05
4 2023-10-10 2023-10-05
5 2023-10-10 null
6 2023-10-10 null

Thanks


Solution

  • From the apache documentation it seems this is supported by the base dialect. I use a platform built on spark with additional bells and whistles so not 100% sure.

    https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html

    last_distinct_date_value gives how I'm interpreting the question.

    last_date_by_id is the existing answer for comparison.

    with some_sample_data as (
    SELECT data.id, _date
        FROM VALUES (1, cast('2023-10-05' as date)),
                    (2, cast('2023-10-05' as date)),
                    (3, cast('2023-10-05' as date)),
                    (4, cast('2023-10-05' as date)),
                    (1, cast('2023-10-09' as date)),
                    (1, cast('2023-10-10' as date)),
                    (2, cast('2023-10-10' as date)),
                    (3, cast('2023-10-10' as date)),
                    (4, cast('2023-10-10' as date)),
                    (5, cast('2023-10-10' as date)),
                    (6, cast('2023-10-10' as date)),
                    (7, cast('2023-10-15' as date))
                     AS data(id, _date)
    )
    select id,
           _date,
           max(_date) over(
             partition by 1
                 order by _date
             range between unbounded preceding and 1 preceding
           ) as last_distinct_date_value,
           lag(_date) over(
             partition by id
                 order by _date
           ) as last_date_by_id
      from some_sample_data
    
    
    
    id  _date       last_distinct_date_value    last_date_by_id
    1   2023-10-05  -                           -
    2   2023-10-05  -                           -
    3   2023-10-05  -                           -
    4   2023-10-05  -                           -
    1   2023-10-09  2023-10-05                  2023-10-05
    1   2023-10-10  2023-10-09                  2023-10-09
    2   2023-10-10  2023-10-09                  2023-10-05
    3   2023-10-10  2023-10-09                  2023-10-05
    4   2023-10-10  2023-10-09                  2023-10-05
    5   2023-10-10  2023-10-09                  -
    6   2023-10-10  2023-10-09                  -
    7   2023-10-15  2023-10-10                  -