pythonapache-sparkpysparkapache-spark-sql

Get max value from an array column and get value with similar index from another column pyspark


I have dataframe like this:

|   id      |   label   |  md  |
+-----------+-----------+------+
|[a, b, c]  | [1, 4, 2] |  3   |
|[b, d]     | [7, 2]    |  1   |
|[a, c]     | [1, 2]    |  8   |

I want to get max value from label column and get value from id column with similar index.

Expected Output:

| id |label|  md  |
+----+-----+------+
| b  |  4  |  3   |
| b  |  7  |  1   |
| c  |  2  |  8   |

Solution

  • You can first combine the "id" and "label" columns using arrays_zip and inline functions, after that you can filter the max value using a window function, (this assumes that the "md" column values are unique, if not create a unique column using dense_rank, and the window will be used over that column)

    df = df.selectExpr("md", "inline(arrays_zip(id, label))")
    w = Window.partitionBy("md")
    df.withColumn("mx_label", functions.max("label").over(w)).filter(col("label") == col("mx_label")).drop("mx_label").show(truncate=False)