Search code examples
arraysrandompysparksample

Random sample from column of ArrayType Pyspark


I've a column in a Pyspark dataframe with a structure like

Column1
[a,b,c,d,e]
[c,b,d,f,g,h,i,p,l,m]

I'd like to return another column with a random selection of each array in each row, the amount specified in the function.

So something like data.withColumn("sample", SOME_FUNCTION("column1", 5)) returning:

sample
[a,b,c,d,e]
[c,b,h,i,p]

Hopefully avoiding a python UDF, feel like there should be a function available??

This works:

import random
def random_sample(population):
    return(random.sample(population, 5))

udf_random = F.udf(random_sample, T.ArrayType(T.StringType()))
df.withColumn("sample", udf_random("column1")).show()

But as I said, it would be good to avoid a UDF.


Solution

  • For spark 2.4+, use shuffle and slice:

    df = spark.createDataFrame([(list('abcde'),),(list('cbdfghiplm'),)],['column1'])
    
    df.selectExpr('slice(shuffle(column1),1,5)').show()
    +-----------------------------+
    |slice(shuffle(column1), 1, 5)|
    +-----------------------------+
    |              [b, a, e, d, c]|
    |              [h, f, d, l, m]|
    +-----------------------------+