sql dataframe pyspark bigdata data-processing

Sub query like SQL in pyspark

I'm trying to do this kind of query:

SELECT age,COUNT(age)
   FROM T
   GROUP BY age
   HAVING age = MIN(SELECT COUNT(age) FROM T GROUP BY age)
   ODER BY COUNT(age)

I tried

min_size = df.groupBy("age").count().select(f.min("count"))
df.groupBy("age").count().sort("count").filter(f.col("count")==min_size).show()

but I get AttributeError: 'DataFrame' object has no attribute '_get_object_id'

Is there any way to use subqueries in PySpark?

Solution

In your case, min_size is a DataFrame, not some integer.
Try to collect it into integer like this:

min_size = df.groupBy("age").count().select(f.min("count")).collect()[0][0]

Delete duplicate records in SQL Server?
SQL - Continue running all SQL statements even after error
Retrieve result of StoredProcedure that has NO out parameter
opposite of "top" in sql server, without using order by, there are no keys/indices
LINQ Pivot Problem - Convert SQL Script to LINQ
What is the difference between a primary key and a unique constraint?
How to SELECT the last 10 rows of an SQL table which has no ID field?
How to remove a specific character from a string, only when it is the first or last character in the string.
extract time from now() in amazon redshift
QL query MS Access error "field could refer to more than one table"
How to get all auto generated foreign key names?
Selecting data from table based on date
SQL transaction locks query weird behavior
PostgreSQL splitting time range into days
How to concatenate strings of a string field in a PostgreSQL 'group by' query?
SPARK SQL Equivalent of Qualify + Row_number statements
Difference between CURRENT_TIMESTAMP and CURRENT_DATE
Returning Continuous Months and Copying Values from the Previous Month
How to drop a column from a Databricks Delta table?
postgres. How to check if a field contains at least one substring from a list of substings?
Oracle retrieve only number in string
SQL ASC and DESC ordering on same column based on another column
ANY and ALL against a small data set producing confusing results
Generating a sequence number in a PostgreSQL database - concurrency & isolation levels
Transact-SQL shorthand join syntax?
How to increase the performance of a Database?
What is the best way to create and populate a numbers table?
What does the new stored procedure default content mean?
Compare tkinter calendar get_date() Date with datetime field in Oracle db :
SQL Selecting if "active" = true