Search code examples
apache-sparkapache-spark-sql

How to find difference between two sequential array items in spark sql


I have a dataset with array column like that:

+--------+--------------------+
|user_id |array_of_items      |
+--------+--------------------+
|4049494 |[x1, x2, x3, x4, x5]|
|3046908 |[x6, x7, x8, x9]    |
...

How can I find arrays like below without udf, using only SQL?

+--------+----------------------------+
|user_id |array_of_items_diff         |
+--------+----------------------------+
|4049494 |[x2-x1, x3-x2, x4-x3, x5-x4]|
|3046908 |[x7-x6, x8-x7, x9-x8]       |
...

Solution

  • You can use the transform function:

    df.selectExpr("""
        slice(
            transform(array_of_items, (x,i) -> array_of_items[i+1] || '-' || x)
          , 1 
          , size(array_of_items)-1
        ) as array_of_item_diff
    """).show(2,0)
    +----------------------------+
    |array_of_item_diff          |
    +----------------------------+
    |[x2-x1, x3-x2, x4-x3, x5-x4]|
    |[x7-x6, x8-x7, x9-x8]       |
    +----------------------------+
    

    If you need actual math instead of concatenation, then just change array_of_items[i+1] || '-' || x to array_of_items[i+1] - x.

    I used slice(...,1,size(array_of_items)-1) to remove the last array item which is null. in case there is no item with null value in the original array, you can also use array_compact(spark 3.4.0+) or filter function to remove the resulting null value.