Search code examples
pythonpysparkdatabricksazure-databricks

Pyspark: Count number of True values per row


Working in databricks, I've got a dataframe which looks like this:

columns = ["a", "b", "c"]
data = [(True, True, True), (True, True, True), (True, False, True)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()

enter image description here

I'd like to create a new column "number_true_values" that contains the number of True values per row. Unfortunately, one does not seem to be able to just sum up True and False values in pyspark like in pandas. The code

import pyspark.sql.functions as F
df.withColumn('number_true_values', sum([F.col(column) for column in df.columns]))

throws the exception AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + 0)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("BOOLEAN" and "INT").;

If I had a dataframe that contains numbers instead like the following...

columns = ["a", "b", "c"]
data = [(1, 0, 1), (1, 0, 0), (1, 1, 1)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()

enter image description here

... the syntax from above would work and return the desired result: enter image description here

How do I count the number of True values per row in databricks?


Solution

  • You can cast your columns to long and then sum it, so something like this,

    import pyspark.sql.functions as F
    
    df.withColumn('number_true_values', sum([F.col(column).cast("long") for column in df.columns])).show()
    

    Output:

    +----+-----+----+------------------+
    |   a|    b|   c|number_true_values|
    +----+-----+----+------------------+
    |true| true|true|                 3|
    |true| true|true|                 3|
    |true|false|true|                 2|
    +----+-----+----+------------------+