python python-3.x apache-spark pyspark yelp

Sorting by count using pyspark

I'm trying to print the top 11 states, the biggest city in each state and the business count for each state. For some reason I am having trouble printing the business count for the state, only getting the count for the city.

Here is the code I am having trouble with

dun=df_busSelected.groupBy("state","city").count().orderBy("count",ascending=False).limit(11).show(truncate=False)

 +-----+----------+-----+ 
|state|city |count| 
+-----+----------+-----+ 
|NV |Las Vegas |29361| 
|ON |Toronto |18904| 
|AZ |Phoenix |18764| 
|NC |Charlotte |9507 | 
|AZ |Scottsdale|8837 | 
|AB |Calgary |7735 | 
|PA |Pittsburgh|7016 | 
|QC |Montréal |6449 | 
|AZ |Mesa |6080 | 
|NV |Henderson |4892 | 
|AZ |Tempe |4550 | 
+-----+----------+-----+

Solution

If I understand correctly what you need to do:

from pyspark.sql.functions import *
df_busSelected = spark.createDataFrame([("NV", "Las Vegas",29361),("ON", "Toronto" ,18904),("AZ", "Phoenix",18764),("NC", "Charlotte",9507),("AZ", "Scottsdale",8837),("AB", "Calgary",7735),("PA", "Pittsburgh",7016),("QC", "Montréal",6449),("AZ", "Mesa",6080),("NV", "Henderson",4892),("AZ", "Tempe",4550)]).toDF("state", "city", "count")

df_busSelected.withColumn("city_total_business", struct(col("count"), col("city")))\
     .groupBy("state")\
     .agg(sort_array(collect_set(col("city_total_business")), False)[0].name("top_city"))\
     .withColumn("city", col("top_city").getItem("city"))\
     .withColumn("count", col("top_city").getItem("count"))\
     .drop("top_city")\
     .show()

Prints out

+-----+----------+-----+
|state|      city|count|
+-----+----------+-----+
|   AZ|   Phoenix|18764|
|   QC|  Montréal| 6449|
|   NV| Las Vegas|29361|
|   NC| Charlotte| 9507|
|   PA|Pittsburgh| 7016|
|   ON|   Toronto|18904|
|   AB|   Calgary| 7735|
+-----+----------+-----+

This returns the city with the highest count for each state. It's easy now to sort and do what you want with them.

Please rate my answer if you like it.