I want to assert if the values of a PySpark DataFrame column are the same across all rows. For example, having the next DataFrame
+----------+----------+
| A | B |
+----------+----------+
| 2.0| 0.0|
| 0.0| 0.0|
| 1.0| 0.0|
| 1.0| 0.0|
| 0.0| 0.0|
| 1.0| 0.0|
| 0.0| 0.0|
+----------+----------+
the column "A" is not constant and "B" is.
I have tried two methods:
1- Check the stddev = 0:
df.select(stddev(col('B'))).collect()
2- Get distinct values:
df.select("B").distinct().collect()
The first method takes 16min to finish and the second one 12min, but it's only one execution, so I'm not sure about significance.
What is the best way to check it in PySpark?
stdev
is quite complicated operation, distinct
too.
If your task is to check if all values in specific column equals to some specific variable, I’d try to do smth like.
df.filter(col('B') != your_value).count() == 0
It might be the case that you don’t know value of that column. But it’s easy to resolve, just by retrieving head(any) value and comparing against it:
your_value = df.select('B').first()[0]