I understand collect_set can have a random order. Is there a different method of ordering a collect_set by count? I want to have an array of the most popular items for a single column based on a group by of a separate id column. Would you have a collect_list and then run the count on that?
No, there is no method to order collect_set
by count, as collect aggregate methods don't count items, information is not available to sort items.
So, since Spark 3.1 and greater, and given a dataframe
with two columns id
and item
, you can:
count
over a groupBy on columns id
and items
(count, item)
couples to an array with collect_list
and struct
. Note: you can use collect_set
here instead of collect_list
, but it is useless as we are sure that each element of (count, item)
is uniquesort_array
to sort your array by descending counttransform
to drop count
.Which can be translated to code as follow:
from pyspark.sql import functions as F
final_df = dataframe.groupBy('id', 'item').count() \
.groupBy('id') \
.agg(
F.transform(
F.sort_array(
F.collect_list(F.struct("count", "item")),
asc=False
),
lambda x: x.getItem('item')
).alias('popular_items')
)
Note: if your spark version lower than 3.1 but greater than 1.6, you can replace transform
with withColumn
as follow:
from pyspark.sql import functions as F
final_df = dataframe.groupBy('id', 'item').count() \
.groupBy('id') \
.agg(F.sort_array(F.collect_list(F.struct("count", "item")), asc=False).alias('popular_items')) \
.withColumn("popular_items", F.col('popular_items.item'))
With the following input dataframe:
+---+-----+
|id |item |
+---+-----+
|1 |item1|
|1 |item2|
|1 |item2|
|1 |item2|
|1 |item3|
|2 |item3|
|2 |item3|
|2 |item1|
|3 |item1|
|3 |item1|
+---+-----+
You get the following output:
+---+---------------------+
|id |popular_items |
+---+---------------------+
|1 |[item2, item3, item1]|
|3 |[item1] |
|2 |[item3, item1] |
+---+---------------------+