Working in databricks, I've got a dataframe which looks like this:
columns = ["a", "b", "c"]
data = [(True, True, True), (True, True, True), (True, False, True)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()
I'd like to create a new column "number_true_values" that contains the number of True values per row. Unfortunately, one does not seem to be able to just sum up True and False values in pyspark like in pandas. The code
import pyspark.sql.functions as F
df.withColumn('number_true_values', sum([F.col(column) for column in df.columns]))
throws the exception AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + 0)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("BOOLEAN" and "INT").;
If I had a dataframe that contains numbers instead like the following...
columns = ["a", "b", "c"]
data = [(1, 0, 1), (1, 0, 0), (1, 1, 1)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()
... the syntax from above would work and return the desired result:
How do I count the number of True values per row in databricks?
You can cast your columns to long
and then sum it, so something like this,
import pyspark.sql.functions as F
df.withColumn('number_true_values', sum([F.col(column).cast("long") for column in df.columns])).show()
Output:
+----+-----+----+------------------+
| a| b| c|number_true_values|
+----+-----+----+------------------+
|true| true|true| 3|
|true| true|true| 3|
|true|false|true| 2|
+----+-----+----+------------------+