I’m quite new to PySpark and I’ve been struggling to find the answer I’m looking for.
I have a large sample of households and I want to conduct systematic sampling. Like true systematic sampling, I would like to begin at a random starting point and then select a household at regular intervals (e.g. every 50th household). I have looked into sample() and sampleBy(), but I don't think these are quite what I need. Can anyone give any advice on how I can do this? Many thanks in advance for your help!
monotonically_increasing_id
works if you have only 1 partition, so if you have more than just 1 partition, you can consider row_number
.
Check "Notes" in https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html
With row_number
,
from pyspark.sql import functions as F
df = df.withColumn("index", F.row_number().over(Window.orderBy('somecol')))
.filter(((F.col('index') + random_start) % 50) == 0)