Search code examples
apache-sparkpyspark

reduce array column by element-wise sum in spark


I have a DataFrame in PySpark with a column "c1" where each row consists of an array of integers

c1
1,2,3
4,5,6
7,8,9

I wish to perform an element-wise sum (i.e just regular vector addition) over this column to reduce it to the single array [12, 15, 18].

What I would want is something akin to df.select(sum("c1")) but which performs vector/array addition over the chosen column.

When searching I mostly find various zip_with and higher-order function solutions that assume I have multiple columns and wish to add rows together. Perhaps I haven't been able to word it right.

EDIT: the c1 column is just an example for ease of writing. I really have a column with a couple thousand entries, where each array has a couple hundred elements.


Solution

With help from notNull I managed to come up with this solution:

# out of bounds array elements will be set to null
max_elements = df_mv_pr.select(max(size("movement_profile"))).first()[0]

# Create a list of column expressions to be used in the select statement
selected_cols = [sum(col("c1")[i]).alias("elem_"+str(i)) for i in range(max_elements )]

df_new = df.select(array(*selected_cols).alias(sum))

Solution

  • You can split the array to three columns then use groupBy to create a sum(array).

    Example:

    df.withColumn("temp1", col("temp")[0]).\
        withColumn("temp2", col("temp")[1]).\
          withColumn("temp3", col("temp")[2]).\
            groupBy(lit("1")).agg(array(sum(col("temp1")),sum(col("temp2")),sum(col("temp3"))).cast("array<int>").alias("sum")).\
              drop("1").\
                show()
    #+------------+
    #|         sum|
    #+------------+
    #|[12, 15, 18]|
    #+------------+