Question: If you implement proportionate stratified sampling using PySpark's sampleBy, isn't it just the same thing as a random sample?
Edit: there is proportionate and disproportionate stratified sampling. This question is about the former.
Here's my thinking on this:
Let's say you have 4 groups in a population of total size N = 1000. The groups have proportions:
A: 25%, B: 50%, C: 13%, and D: 12%
Then choosing a proportionate stratified sample of size 100 means choosing a sample consisting of exactly 25 elements from A, 50 elements from B, 13 elements from C and 12 elements from D. (Note: A disproportionate stratified sample would be if you had different sampling ratios than those of the population.)
This is in contrast to doing a random sample where the expected number of elements from A, B, C and D are 25, 50, 13, and 12 respectively.
It would be natural to implement proportionate stratified sampling in PySpark via the sampleBy
method with fractions
fractions = {'A': .1, 'B': .1, 'C': .1, 'D': .1}
If this method sampled exactly, you'd have 25, 50, 13 and 12 elements respectively. However, this method is implemented with a Bernoulli trial (coin flipping). For stratified sampling, since all the fractions are identical, so each element is chosen with probability 10%.
In this case doing the Bernoulli trial by strata and then by element is the same as doing this over the entire data set. The latter is just random sampling.
Conclusion: Stratified sampling is "not possible" in this paradigm. Is this a correct understanding?
I've seen some posts on doing exact sampling using special tricks. I'll see if I can answer my own post using these methods (3) below.
Note: There is a sampleByKeyExact
method but it is not supported in Python and if it was, the performance and scaling penalties are not ideal.
https://spark.apache.org/docs/2.2.0/mllib-statistics.html
Related Posts:
Stratified sampling in Spark (Mentions sampleByKeyExact
which isn't supported in Python)
Investopedia Reference https://www.investopedia.com/terms/stratified_random_sampling.asp
A creative work-around using additional columns that may work. pyspark - how to select exact number of records per strata using (df.sampleByKey()) in stratified random sampling
I think there is some confusion here related to standard definitions. Usually when someone says "stratified sampling", they mean that different classes should get different probabilities. In the example posted above
A: 25%, B: 50%, C: 13%, and D: 12%
A standard stratified sample will give fractions that make sure that in expectation, the sample will have the same number of A,B,C,D. For example
fractions = {'A': .2, 'B': .1, 'C': 0.1*50/13, 'D': 0.1*50/12}
should give in expectation 5 elements of each class.
In the example given above where
fractions = {'A': .1, 'B': .1, 'C': 0.1, 'D': 0.1}
The behavior is indeed the same as a simple sample
with a proportion of 0.1.
The real question is, what are you aiming for? If you want your sample to have the exact same proportion of classes as the original, then neither sample
or sampleByKey
will provide that. Looking at the documentation, it seems that indeed sampleByKeyExact
will do the trick.
edit detailing the behavior of sample
and sampleByKey
:
For sample
, a map operation basically goes over every element and based on a random variable decides whether to keep the item (and how many copies in case withReplacement == True
). This random variable is i.i.d across all elements. In sampleByKey
, the random variable is independent but has a different distribution based on the key value, or more accurately based on the corresponding value in the fractions
argument. If the values in fractions
are identical, this random variable will have the same distribution for all key values - that is why the behavior becomes identical for sample
and sampleByKey
.