Search code examples
pysparksampling

Systematic sampling in PySpark


I’m quite new to PySpark and I’ve been struggling to find the answer I’m looking for.

I have a large sample of households and I want to conduct systematic sampling. Like true systematic sampling, I would like to begin at a random starting point and then select a household at regular intervals (e.g. every 50th household). I have looked into sample() and sampleBy(), but I don't think these are quite what I need. Can anyone give any advice on how I can do this? Many thanks in advance for your help!


Solution

  • monotonically_increasing_id works if you have only 1 partition, so if you have more than just 1 partition, you can consider row_number.

    Check "Notes" in https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html

    With row_number,

    from pyspark.sql import functions as F
    df = df.withColumn("index", F.row_number().over(Window.orderBy('somecol')))
         .filter(((F.col('index') + random_start) % 50) == 0)