Search code examples
dataframepyspark

Fastest way to know if a column has a constant value in a PySpark dataframe


I want to assert if the values of a PySpark DataFrame column are the same across all rows. For example, having the next DataFrame

+----------+----------+
|    A     |    B     |
+----------+----------+
|       2.0|       0.0|
|       0.0|       0.0|
|       1.0|       0.0|
|       1.0|       0.0|
|       0.0|       0.0|
|       1.0|       0.0|
|       0.0|       0.0|
+----------+----------+

the column "A" is not constant and "B" is.

I have tried two methods:

1- Check the stddev = 0:

df.select(stddev(col('B'))).collect()

2- Get distinct values:

df.select("B").distinct().collect()

The first method takes 16min to finish and the second one 12min, but it's only one execution, so I'm not sure about significance.

What is the best way to check it in PySpark?


Solution

  • stdev is quite complicated operation, distinct too. If your task is to check if all values in specific column equals to some specific variable, I’d try to do smth like.

    df.filter(col('B') != your_value).count() == 0

    It might be the case that you don’t know value of that column. But it’s easy to resolve, just by retrieving head(any) value and comparing against it:

    your_value = df.select('B').first()[0]