Search code examples
pythonarraysloopspysparkdifference

Create column of array of differences between two adjacent numbers in another column's array python/pyspark


I have a column of arrays made of numbers, ie [0,80,160,220], and would like to create a column of arrays of the differences between adjacent terms, ie [80,80,60]

Does anyone have an idea how to do this in Python/PySpark? my code is df=df.withcolumn('col_array_diffs', [df.col_array.getItem[i]-df.col_array.getItem[i-1] if i else None for i in range(1,F.size(df.col_array))]) but am really struggling with the arraytype. This produces AssertionError: col should be Column...Thanks!


Solution

  • You can use a UDF to do this.

    import pyspark.sql.types as T
    
    def subtract_el(x):
        return [abs(i-j) for i, j in list(zip(x, x[1:]))]
    
    df = spark.createDataFrame(pd.DataFrame([[[0,80,160,220]]]))
    df.select(F.udf(subtract_el, T.ArrayType(T.IntegerType()))("0").alias("diff")).show()
    

    Results in :

    +------------+
    |        diff|
    +------------+
    |[80, 80, 60]|
    +------------+