Systematic sampling in PySpark

I’m quite new to PySpark and I’ve been struggling to find the answer I’m looking for.

I have a large sample of households and I want to conduct systematic sampling. Like true systematic sampling, I would like to begin at a random starting point and then select a household at regular intervals (e.g. every 50th household). I have looked into sample() and sampleBy(), but I don't think these are quite what I need. Can anyone give any advice on how I can do this? Many thanks in advance for your help!

Solution

monotonically_increasing_id works if you have only 1 partition, so if you have more than just 1 partition, you can consider row_number.

Check "Notes" in https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html

With row_number,

from pyspark.sql import functions as F
df = df.withColumn("index", F.row_number().over(Window.orderBy('somecol')))
     .filter(((F.col('index') + random_start) % 50) == 0)