Search code examples
apache-sparkpysparksparkr

Summing multiple columns in Spark


How can I sum multiple columns in Spark? For example, in SparkR the following code works to get the sum of one column, but if I try to get the sum of both columns in df, I get an error.

# Create SparkDataFrame
df <- createDataFrame(faithful)

# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works

# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work

Either SparkR or PySpark code will work.


Solution

  • For PySpark, if you don't want to explicitly type out the columns:

    from operator import add
    from functools import reduce
    new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))