python pyspark databricks azure-databricks

Pyspark: Count number of True values per row

Working in databricks, I've got a dataframe which looks like this:

columns = ["a", "b", "c"]
data = [(True, True, True), (True, True, True), (True, False, True)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()

I'd like to create a new column "number_true_values" that contains the number of True values per row. Unfortunately, one does not seem to be able to just sum up True and False values in pyspark like in pandas. The code

import pyspark.sql.functions as F
df.withColumn('number_true_values', sum([F.col(column) for column in df.columns]))

throws the exception AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + 0)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("BOOLEAN" and "INT").;

If I had a dataframe that contains numbers instead like the following...

columns = ["a", "b", "c"]
data = [(1, 0, 1), (1, 0, 0), (1, 1, 1)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()

... the syntax from above would work and return the desired result:

How do I count the number of True values per row in databricks?

Solution

You can cast your columns to long and then sum it, so something like this,

import pyspark.sql.functions as F

df.withColumn('number_true_values', sum([F.col(column).cast("long") for column in df.columns])).show()

Output:

+----+-----+----+------------------+
|   a|    b|   c|number_true_values|
+----+-----+----+------------------+
|true| true|true|                 3|
|true| true|true|                 3|
|true|false|true|                 2|
+----+-----+----+------------------+