I'm trying to print the top 11 states, the biggest city in each state and the business count for each state. For some reason I am having trouble printing the business count for the state, only getting the count for the city.
Here is the code I am having trouble with
dun=df_busSelected.groupBy("state","city").count().orderBy("count",ascending=False).limit(11).show(truncate=False)
+-----+----------+-----+
|state|city |count|
+-----+----------+-----+
|NV |Las Vegas |29361|
|ON |Toronto |18904|
|AZ |Phoenix |18764|
|NC |Charlotte |9507 |
|AZ |Scottsdale|8837 |
|AB |Calgary |7735 |
|PA |Pittsburgh|7016 |
|QC |Montréal |6449 |
|AZ |Mesa |6080 |
|NV |Henderson |4892 |
|AZ |Tempe |4550 |
+-----+----------+-----+
If I understand correctly what you need to do:
from pyspark.sql.functions import *
df_busSelected = spark.createDataFrame([("NV", "Las Vegas",29361),("ON", "Toronto" ,18904),("AZ", "Phoenix",18764),("NC", "Charlotte",9507),("AZ", "Scottsdale",8837),("AB", "Calgary",7735),("PA", "Pittsburgh",7016),("QC", "Montréal",6449),("AZ", "Mesa",6080),("NV", "Henderson",4892),("AZ", "Tempe",4550)]).toDF("state", "city", "count")
df_busSelected.withColumn("city_total_business", struct(col("count"), col("city")))\
.groupBy("state")\
.agg(sort_array(collect_set(col("city_total_business")), False)[0].name("top_city"))\
.withColumn("city", col("top_city").getItem("city"))\
.withColumn("count", col("top_city").getItem("count"))\
.drop("top_city")\
.show()
Prints out
+-----+----------+-----+
|state| city|count|
+-----+----------+-----+
| AZ| Phoenix|18764|
| QC| Montréal| 6449|
| NV| Las Vegas|29361|
| NC| Charlotte| 9507|
| PA|Pittsburgh| 7016|
| ON| Toronto|18904|
| AB| Calgary| 7735|
+-----+----------+-----+
This returns the city with the highest count for each state. It's easy now to sort and do what you want with them.
Please rate my answer if you like it.