Search code examples
sqlapache-sparkapache-spark-sqlwindow-functionsgaps-and-islands

Sub Grouping the results using Apache Spark SQL


I have the following events table, I would like to group them smaller time buckets as specified below.

The table have to be divided into smaller sets where start and end row of the set is determined by geohash if the geohash is same then set keep including the rows until it found next geohash as different.

key time_stamp  geohash
k1  1           abcdfg
k1  5           abcdfg
k1  7           abcdf1
k1  9           abcdfg
k1  10          abcdf2
k1  12          abcdf2
k1  21          abcdf2

How can I produce the following output using Apache Spark SQL syntax

key geohash first_time  last_time   duration    num_events
k1  abcdfg  1           5           4           2
k1  abcdf1  7           7           0           1
k1  abcdfg  9           9           0           1
k1  abcdf2  10          21          11          3

Can someone help me in achieving this.


Solution

  • This is a kind of gaps-and-island problem. Here is one way to solve it using row_number() and aggregation:

    select
        key, 
        geohash, 
        min(timestamp) first_time,
        max(timestamp) last_time,
        max(timestamp) - min(timestamp) duration,
        count(*) num_events
    from (
        select
            t.*,
            row_number() over(partition by key order by timestamp) rn1,
            row_number() over(partition by key, geohash order by timestamp) rn2
        from mytable t
    ) t
    group by 
        key,
        geohash,
        rn1 - rn2
    

    And, just for the fun of it: you could also do this with a conditional window sum:

    select
        key, 
        geohash, 
        min(timestamp) first_time,
        max(timestamp) last_time,
        max(timestamp) - min(timestamp) duration,
        count(*) num_events
    from (
        select
            t.*,
            sum(case when lag_geohash = geohash then 0 else 1 end) 
                over(partition by key order by timestamp) grp
        from (
            select
                t.*,
                lag(geohash) over(partition by key order by timestamp) lag_geohash
            from mytable t
        ) t 
    ) t
    group by 
        key,
        geohash,
        grp