Search code examples
sqlapache-sparkpysparkapache-spark-sqlwindow-functions

How to define WINDOWING function in Spark SQL query to avoid repetitive code


I have a query which has many lead and lag, due to which the partition by code is repeated.

If I use Scala code I can define the window spec and reuse it , so is there a way I can reuse the partition code in Spark SQL.

Objective is to avoid the repetition of "over ( partition by sessionId, deviceId order by entry_datetime ) "

SELECT * ,
lag( channel,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as prev_chnl,
lead( channel,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as next_chnl,
lag( channel-source,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as prev_chnl_source,
lead( channel-source,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as next_chnl_source,
FROM RAW_VIEW

RAW_VIEW

+------------+-----------+---------------------+---------+-----------------+
|sessionId   |deviceId   |entry_datetime       |channel  |channel-source   |
+------------+-----------+---------------------+---------+-----------------+
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:00:00.0|001      |Internet         |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 16:00:00.0|002      |Cable            |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 17:00:00.0|003      |Satellite        |
+------------+-----------+---------------------+---------+-----------------+

FINAL VIEW

+------------+-----------+---------------------+---------+-----------------+---------+---------+-----------------+-----------------+
|sessionId   |deviceId   |entry_datetime       |channel  |channel-source   |prev_chnl|next_chnl|prev_chnl_source |next_chnl_source |
+------------+-----------+---------------------+---------+-----------------+---------+---------+-----------------+-----------------+
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:00:00.0|001      |Internet         |null     |002      |null             |Cable            |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:01:00.0|002      |Cable            |001      |003      |Internet         |Satellite        |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:02:00.0|003      |Satellite        |002      |null     |Cable            |null             |
+------------+-----------+---------------------+---------+-----------------+---------+---------+-----------------+-----------------+

Solution

  • You should be able to define named window and reference it in the query:

    SELECT * ,
      lag(channel, 1) OVER w AS prev_chnl,
      lead(channel, 1) OVER w AS next_chnl,
      lag(channel-source, 1) OVER w AS prev_chnl_source,
      lead(channel-source, 1) OVER w AS next_chnl_source,
    FROM raw_view
    WINDOW w AS (PARTITION BY sessionId, deviceId ORDER BY entry_datetime)
    

    but it looks like this functionality is currently broken.