I'm trying to do this kind of query:
SELECT age,COUNT(age)
FROM T
GROUP BY age
HAVING age = MIN(SELECT COUNT(age) FROM T GROUP BY age)
ODER BY COUNT(age)
I tried
min_size = df.groupBy("age").count().select(f.min("count"))
df.groupBy("age").count().sort("count").filter(f.col("count")==min_size).show()
but I get AttributeError: 'DataFrame' object has no attribute '_get_object_id'
Is there any way to use subqueries in PySpark?
In your case, min_size
is a DataFrame, not some integer.
Try to collect it into integer like this:
min_size = df.groupBy("age").count().select(f.min("count")).collect()[0][0]