Search code examples
python-3.xpysparksample

How to create a stratified sample for a boolean field with True, False and Null values?


I've a DataFrame with a boolean field.

df = spark.createDataFrame([
  [True],   
  [False],   
  [None],
  [True],   
  [False],
  [None]
]).toDF("match")

I want to create a stratified sample (PySpark) with equal True, False and Null values.

How can I also get the Null values in my sample (None: 0.3 is not accepted)

sampled = df.sampleBy("match", fractions={True: 0.3, False: 0.3})

Solution

  • Based on the source code of sampleBy method, the parameter fractions is a Map[T, Double], and for a MapType column with Spark, null keys are not allowed (see doc)

    def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
      sampleBy(Column(col), fractions, seed)
    }
    

    One possible solution is to add a flag to convert False, True and NULL to 0, 1, 2 and then do sampleBy based on this flag, for example:

    from pyspark.sql.functions import expr
    
    df_sample = df.withColumn('flag', expr("coalesce(int(match), 2)")) \
        .sampleBy("flag", {0:0.3, 1:0.3, 2:0.3}) \
        .drop("flag")