I have a DataFrame in PySpark with a column "c1" where each row consists of an array of integers
c1 |
---|
1,2,3 |
4,5,6 |
7,8,9 |
I wish to perform an element-wise sum (i.e just regular vector addition) over this column to reduce it to the single array [12, 15, 18].
What I would want is something akin to
df.select(sum("c1"))
but which performs vector/array addition over the chosen column.
When searching I mostly find various
zip_with
and higher-order function solutions that assume I have multiple columns and wish to add rows together. Perhaps I haven't been able to word it right.
EDIT: the c1 column is just an example for ease of writing. I really have a column with a couple thousand entries, where each array has a couple hundred elements.
Solution
With help from notNull I managed to come up with this solution:
# out of bounds array elements will be set to null
max_elements = df_mv_pr.select(max(size("movement_profile"))).first()[0]
# Create a list of column expressions to be used in the select statement
selected_cols = [sum(col("c1")[i]).alias("elem_"+str(i)) for i in range(max_elements )]
df_new = df.select(array(*selected_cols).alias(sum))
You can split the array to three columns then use groupBy
to create a sum(array)
.
Example:
df.withColumn("temp1", col("temp")[0]).\
withColumn("temp2", col("temp")[1]).\
withColumn("temp3", col("temp")[2]).\
groupBy(lit("1")).agg(array(sum(col("temp1")),sum(col("temp2")),sum(col("temp3"))).cast("array<int>").alias("sum")).\
drop("1").\
show()
#+------------+
#| sum|
#+------------+
#|[12, 15, 18]|
#+------------+