I have pyspark column Like this:
gm_array
[[1, 4, 6,...], [2, 7, 8,...], [3, 5, 7,...],...]
[[8, 11, 9,...], [7, 2, 6,...], [10, 9, 8,...],...]
[[90, 13, 67,...], [55, 6, 98,...], [1, 6, 2,...],...]
.
.
Now I want to shuffle this single array and also array inside this array, and then I want to pick 5 first element from first 5 array.
1st Out which is randomly shuffle array:
gm_array
[[19, 6, 1,...], [9, 80, 5,...], [30, 7, 3,...],...]
[[7, 9, 11,...], [6, 8, 7,...], [18, 7, 10,...],...]
[[90, 1, 7,...], [8, 9, 81,...], [6, 5, 1,...],...]
.
.
2nd Out 1st element of 1st 5 array inside main array:
[19, 9, 30,...]
[7, 6, 18,...]
[[90, 8, 6,...]
.
.
Using some array and higher-order functions you can do:
import random
from pyspark.sql import functions as F
# example of input dataframe
df = spark.createDataFrame(
[
([[random.randint(1, 100) for _ in range(5)] for _ in range(6)],)
for _ in range(4)
],
["gm_array"]
)
# first step: shuffle arrays
df_shuffled = df.withColumn(
"gm_array",
F.shuffle(F.transform("gm_array", lambda x: F.shuffle(x)))
)
# second step: pick top 5 elements
df_top_5 = df_shuffled.withColumn(
"gm_array",
F.transform(F.slice("gm_array", 1, 5), lambda x: x[0])
)
df_top_5.show(truncate=False)
#+--------------------+
#|gm_array |
#+--------------------+
#|[77, 44, 6, 23, 100]|
#|[40, 57, 10, 32, 27]|
#|[3, 45, 17, 9, 9] |
#|[62, 39, 10, 95, 17]|
#+--------------------+