Search code examples
pythonapache-sparkpysparkapache-spark-sqlshuffle

Shuffle array of arrays pyspark columns


I have pyspark column Like this:

                   gm_array
[[1, 4, 6,...], [2, 7, 8,...], [3, 5, 7,...],...]
[[8, 11, 9,...], [7, 2, 6,...], [10, 9, 8,...],...]
[[90, 13, 67,...], [55, 6, 98,...], [1, 6, 2,...],...]
.
.

Now I want to shuffle this single array and also array inside this array, and then I want to pick 5 first element from first 5 array.

1st Out which is randomly shuffle array:

                  gm_array
[[19, 6, 1,...], [9, 80, 5,...], [30, 7, 3,...],...]
[[7, 9, 11,...], [6, 8, 7,...], [18, 7, 10,...],...]
[[90, 1, 7,...], [8, 9, 81,...], [6, 5, 1,...],...]
.
.

2nd Out 1st element of 1st 5 array inside main array:

[19, 9, 30,...]
[7, 6, 18,...]
[[90, 8, 6,...]
.
.

Solution

  • Using some array and higher-order functions you can do:

    import random
    from pyspark.sql import functions as F
    
    # example of input dataframe
    df = spark.createDataFrame(
        [
            ([[random.randint(1, 100) for _ in range(5)] for _ in range(6)],)
            for _ in range(4)
        ],
        ["gm_array"]
    )
    
    # first step: shuffle arrays
    df_shuffled = df.withColumn(
        "gm_array",
        F.shuffle(F.transform("gm_array", lambda x: F.shuffle(x)))
    )
    
    # second step: pick top 5 elements
    df_top_5 = df_shuffled.withColumn(
        "gm_array",
        F.transform(F.slice("gm_array", 1, 5), lambda x: x[0])
    )
    
    df_top_5.show(truncate=False)
    #+--------------------+
    #|gm_array            |
    #+--------------------+
    #|[77, 44, 6, 23, 100]|
    #|[40, 57, 10, 32, 27]|
    #|[3, 45, 17, 9, 9]   |
    #|[62, 39, 10, 95, 17]|
    #+--------------------+