Search code examples
pythonapache-sparkpysparkdatabricksazure-databricks

Check whether boolean column contains only True values


Working in Databricks, I've got a dataframe which looks like this:

columns = ["a", "b", "c"]
data = [(True, True, True), (True, True, True), (True, False, True)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()

enter image description here

I'd like to select only those columns of the dataframe in which not all values are True.
In pandas, I would use df['a'].all() to check whether all values of column "a" are True. Unfortunately, I don't find an equivalent in PySpark. I have found a solution for the problem, but it seems much too complicated:

df.select(*[column for column in df.columns 
            if df.select(column).distinct().collect() != 
            spark.createDataFrame([True], 'boolean').toDF(column).collect()])

The solution returns what I want:

enter image description here

Is there an easier way of doing this in PySpark?


Solution

  • Here is one approach which follows a similar operational model to that of pandas

    def is_all(c):
        return (F.sum(F.col(c).astype('int')) == F.count(c)).alias(c)
    
    # create a boolean mask, 
    # for e.g in pandas this would similar to df.all(axis=0)
    mask = df.agg(*[is_all(c) for c in columns]).collect()[0]
    
    # Use the boolean mask to filter the columns
    result = df.select(*[c for c in columns if not mask[c]])
    

    Result

    +-----+
    |    b|
    +-----+
    | true|
    | true|
    |false|
    +-----+