Search code examples
sqldataframepysparkbigdatadata-processing

Sub query like SQL in pyspark


I'm trying to do this kind of query:

SELECT age,COUNT(age)
   FROM T
   GROUP BY age
   HAVING age = MIN(SELECT COUNT(age) FROM T GROUP BY age)
   ODER BY COUNT(age) 

I tried

min_size = df.groupBy("age").count().select(f.min("count"))
df.groupBy("age").count().sort("count").filter(f.col("count")==min_size).show()

but I get AttributeError: 'DataFrame' object has no attribute '_get_object_id'

Is there any way to use subqueries in PySpark?


Solution

  • In your case, min_size is a DataFrame, not some integer.
    Try to collect it into integer like this:

    min_size = df.groupBy("age").count().select(f.min("count")).collect()[0][0]