Search code examples
sqlimpalahue

Impala SQL: Merging rows with overlapping dates. WHERE EXISTS and recursive CTE not supported


I am trying to merge rows with overlapping date intervals in a table in Impala SQL. However the solutions I have found to solve this are not supported by Impala eg. WHERE EXISTS and recursive CTEs.

How would I write a query for this in Impala?

    Table: @T
    ID  StartDate   EndDate
    1   20170101    20170201
    2   20170101    20170401
    3   20170505    20170531    
    4   20170530    20170531
    5   20170530    20170831
    6   20171001    20171005
    7   20171101    20171225
    8   20171105    20171110

    Required Output:
    StartDate   EndDate
    20170101    20170401
    20170505    20170831
    20171001    20171005

Example of what I am trying to achieve that is not supported in Impala:

    SELECT 
           s1.StartDate,
           MIN(t1.EndDate) AS EndDate
    FROM @T s1 
    INNER JOIN @T t1 ON s1.StartDate <= t1.EndDate
      AND NOT EXISTS(SELECT * FROM @T t2 
             WHERE t1.EndDate >= t2.StartDate AND t1.EndDate < t2.EndDate) 
    WHERE NOT EXISTS(SELECT * FROM @T s2 
                     WHERE s1.StartDate > s2.StartDate AND s1.StartDate <= s2.EndDate) 
    GROUP BY s1.StartDate 
    ORDER BY s1.StartDate 

Similar questions:

Merge overlapping date intervals

Eliminate and reduce overlapping date ranges

https://gerireshef.wordpress.com/2010/05/02/packing-date-intervals/

https://www.sqlservercentral.com/Forums/Topic826031-8-1.aspx


Solution

  • select  min(StartDate)  as StartDate
           ,max(EndDate)    as EndDate
    
    from   (select  StartDate,EndDate
                   ,count (is_gap) over
                    (
                        order by    StartDate,ID
                    )   as range_id
    
            from   (select  ID,StartDate,EndDate
                           ,case 
                                when    max (EndDate) over
                                        (
                                            order by    StartDate,ID
                                            rows        between unbounded preceding 
                                                        and     1 preceding
                                        ) < StartDate
                                then    true
                            end as is_gap
    
                    from    t
                    ) t
            ) t
    
    group by    range_id
    
    order by    StartDate
    ;
    

    +------------+------------+
    | startdate  | enddate    |
    +------------+------------+
    | 2017-01-01 | 2017-04-01 |
    | 2017-05-05 | 2017-08-31 |
    | 2017-10-01 | 2017-10-05 |
    | 2017-11-01 | 2017-12-25 |
    +------------+------------+