I have dataframe like this:
| id | label | md |
+-----------+-----------+------+
|[a, b, c] | [1, 4, 2] | 3 |
|[b, d] | [7, 2] | 1 |
|[a, c] | [1, 2] | 8 |
I want to get max value from label column and get value from id column with similar index.
Expected Output:
| id |label| md |
+----+-----+------+
| b | 4 | 3 |
| b | 7 | 1 |
| c | 2 | 8 |
You can first combine the "id" and "label" columns using arrays_zip
and inline
functions, after that you can filter the max value using a window function, (this assumes that the "md" column values are unique, if not create a unique column using dense_rank, and the window will be used over that column)
df = df.selectExpr("md", "inline(arrays_zip(id, label))")
w = Window.partitionBy("md")
df.withColumn("mx_label", functions.max("label").over(w)).filter(col("label") == col("mx_label")).drop("mx_label").show(truncate=False)