Working in Databricks, I've got a dataframe which looks like this:
columns = ["a", "b", "c"]
data = [(True, True, True), (True, True, True), (True, False, True)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()
I'd like to select only those columns of the dataframe in which not all values are True.
In pandas, I would use df['a'].all()
to check whether all values of column "a" are True. Unfortunately, I don't find an equivalent in PySpark.
I have found a solution for the problem, but it seems much too complicated:
df.select(*[column for column in df.columns
if df.select(column).distinct().collect() !=
spark.createDataFrame([True], 'boolean').toDF(column).collect()])
The solution returns what I want:
Is there an easier way of doing this in PySpark?
Here is one approach which follows a similar operational model to that of pandas
def is_all(c):
return (F.sum(F.col(c).astype('int')) == F.count(c)).alias(c)
# create a boolean mask,
# for e.g in pandas this would similar to df.all(axis=0)
mask = df.agg(*[is_all(c) for c in columns]).collect()[0]
# Use the boolean mask to filter the columns
result = df.select(*[c for c in columns if not mask[c]])
Result
+-----+
| b|
+-----+
| true|
| true|
|false|
+-----+