sql postgresql apache-spark window-functions gaps-and-islands

Postgresql - Find start date and end date for continuous status

here is my dataset

processed_on|status |
------------+-------+
2023-01-01 |Success|
2023-01-02 |Success|
2023-01-03 |Success|
2023-01-04 |Fail   |
2023-01-05 |Fail   |
2023-01-06 |Success|
2023-01-07 |Fail   |
2023-01-08 |Success|
2023-01-09 |Success|

expected output is

--------------------------
start_date|end_date|status
--------------------------
2023-01-01|2023-01-03|Success
2023-01-04|2023-01-05|Fail
2023-01-06|2023-01-06|Success
2023-01-07|023-01-07|Fail
2023-01-08|2023-01-09|Success

i tried window function like led and lag .but no luck..

Solution

You can use SQL window function to get the result you want in a few steps.

Step 1. First use window function to compare previous status and current status, if they are different then mark it as status bucket starting point.

Step 2. Then use window function to count how many status_bucket_starting_point we have reach so far, define it as status bucket number.

Step 3. At last, group by stauts bucket number and use min()/max() function to get start date and end date of the bucket,

Here is the PostgreSQL query:

with data_with_bucket_start AS (
select
    processed_on,
    status,
    case when lag(status) over (order by processed_on) = status then 0 else 1 end as status_bucket_start   
from
    data_table
),
data_with_bucket AS (
select
    processed_on,
    status,
    sum(status_bucket_start) over (order by processed_on) as status_bucket
from 
    data_with_bucket_start
)
select
    min(processed_on) as start_date,
    max(processed_on) as end_date,
    max(status) as stauts
from
    data_with_bucket
group by
    status_bucket
order by
    status_bucket

start_date	end_date	stauts
2023-01-01	2023-01-03	Success
2023-01-04	2023-01-05	Fail
2023-01-06	2023-01-06	Success
2023-01-07	2023-01-07	Fail
2023-01-08	2023-01-09	Success