python apache-spark pyspark apache-spark-sql shuffle

Shuffle array of arrays pyspark columns

I have pyspark column Like this:

                   gm_array
[[1, 4, 6,...], [2, 7, 8,...], [3, 5, 7,...],...]
[[8, 11, 9,...], [7, 2, 6,...], [10, 9, 8,...],...]
[[90, 13, 67,...], [55, 6, 98,...], [1, 6, 2,...],...]
.
.

Now I want to shuffle this single array and also array inside this array, and then I want to pick 5 first element from first 5 array.

1st Out which is randomly shuffle array:

                  gm_array
[[19, 6, 1,...], [9, 80, 5,...], [30, 7, 3,...],...]
[[7, 9, 11,...], [6, 8, 7,...], [18, 7, 10,...],...]
[[90, 1, 7,...], [8, 9, 81,...], [6, 5, 1,...],...]
.
.

2nd Out 1st element of 1st 5 array inside main array:

[19, 9, 30,...]
[7, 6, 18,...]
[[90, 8, 6,...]
.
.

Solution

Using some array and higher-order functions you can do:

import random
from pyspark.sql import functions as F

# example of input dataframe
df = spark.createDataFrame(
    [
        ([[random.randint(1, 100) for _ in range(5)] for _ in range(6)],)
        for _ in range(4)
    ],
    ["gm_array"]
)

# first step: shuffle arrays
df_shuffled = df.withColumn(
    "gm_array",
    F.shuffle(F.transform("gm_array", lambda x: F.shuffle(x)))
)

# second step: pick top 5 elements
df_top_5 = df_shuffled.withColumn(
    "gm_array",
    F.transform(F.slice("gm_array", 1, 5), lambda x: x[0])
)

df_top_5.show(truncate=False)
#+--------------------+
#|gm_array            |
#+--------------------+
#|[77, 44, 6, 23, 100]|
#|[40, 57, 10, 32, 27]|
#|[3, 45, 17, 9, 9]   |
#|[62, 39, 10, 95, 17]|
#+--------------------+