Search code examples
sqlnetezzalead

SQL lead function based on two columns


I have a table with about 700 million rows, which have the below sample for only one line_id.

LINE_ID|COLLECTION_DATE    |DSL_CARD_TYPE|
-------|-------------------|-------------|
1234567|2020-03-25 08:46:08|ADSL_PORT    |
1234567|2020-03-26 08:31:48|ADSL_PORT    |
1234567|2020-03-27 08:42:40|VDSL_PORT    |
1234567|2020-03-28 08:36:32|VDSL_PORT    |
1234567|2020-03-29 08:31:33|VDSL_PORT    |
1234567|2020-03-30 08:50:15|VDSL_PORT    |
1234567|2020-04-31 08:44:33|ADSL_PORT    |
1234567|2020-03-01 08:34:53|ADSL_PORT    |
1234567|2020-04-02 08:44:11|ADSL_PORT    |
1234567|2020-04-03 08:43:51|VDSL_PORT    |
1234567|2020-04-04 08:54:33|ADSL_PORT    |
1234567|2020-04-05 09:06:47|ADSL_PORT    |
1234567|2020-04-06 09:06:57|VDSL_PORT    |
1234567|2020-04-07 09:13:32|VDSL_PORT    |

What I need is to group DSL_CARD_TYPE and create a new column called Next_COLLECTION_DATE to get the next DSL_CARD_TYPE like below

LINE_ID|COLLECTION_DATE    |Next_COLLECTION_DATE  |DSL_CARD_TYPE|
-------|-------------------|----------------------|-------------|
1234567|2020-03-25 08:46:08|2020-03-26 08:31:48   |ADSL_PORT    |  
1234567|2020-03-27 08:42:40|2020-03-30 08:50:15   |VDSL_PORT    |
1234567|2020-03-31 08:34:53|2020-04-02 08:44:11   |ADSL_PORT    |
1234567|2020-04-03 08:43:51|2020-04-03 08:43:51   |VDSL_PORT    |   
1234567|2020-04-04 08:54:33|2020-04-05 09:06:47   |ADSL_PORT    |  
1234567|2020-04-06 09:06:57|2020-04-07 09:13:32   |VDSL_PORT    | 
  

I have created a very dummy and complex query to do the job, but with this huge amount of data it takes hours

COALESCE (lead (COLLECTION_DATE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE),NOW() )Next_Collection_Date,
DSL_CARD_TYPE 
FROM (
SELECT * FROM (
SELECT
    LINE_ID, COLLECTION_DATE, 
    DSL_CARD_TYPE , 
    lead (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) To_Sync_Port, 
    lag (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) B_Sync_Port
FROM
    ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL
    WHERE SYNC_PORT <>  TO_SYNC_PORT OR B_Sync_Port IS NULL )abc2```

Solution

  • This looks like a gaps-and-islands problem, which in this case is probably best solved using the difference of row numbers:

    select line_id, dsl_card_type, min(collection_date), max(collection_date)
    from (select v.*,
                 row_number() over (partition by line_id order by collection_date) as seqnum,
                 row_number() over (partition by line_id, dsl_card_type order by collection_date) as seqnum_2
          from ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL v
          where collection_date >= '2020-07-27 00:00:00'
         ) v
    group by line_id, dsl_card_type, (seqnum - seqnum_2);
    

    It is a little tricky to explain how this works. If you run the subquery, you can see how the difference between the two row numbers defines the adjacent rows with the same card type.