I've a DataFrame with a boolean field.
df = spark.createDataFrame([
[True],
[False],
[None],
[True],
[False],
[None]
]).toDF("match")
I want to create a stratified sample (PySpark) with equal True, False and Null values.
How can I also get the Null values in my sample (None: 0.3
is not accepted)
sampled = df.sampleBy("match", fractions={True: 0.3, False: 0.3})
Based on the source code of sampleBy
method, the parameter fractions
is a Map[T, Double]
, and for a MapType column with Spark, null keys are not allowed (see doc)
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
sampleBy(Column(col), fractions, seed)
}
One possible solution is to add a flag to convert False
, True
and NULL
to 0
, 1
, 2
and then do sampleBy based on this flag, for example:
from pyspark.sql.functions import expr
df_sample = df.withColumn('flag', expr("coalesce(int(match), 2)")) \
.sampleBy("flag", {0:0.3, 1:0.3, 2:0.3}) \
.drop("flag")