I would like to group by first_name
and for each fist_name
get the lowest age
.
My query which I run in online sql compiler works fine but when I try to use duckDB in python I get error that I try to return multiple columns, but this is exactly what I want:
duckdb.BinderException: Binder Error: Subquery returns 2 columns - expected 1
LINE 4: WHERE (first_name, age) IN (
My data:
customer_id first_name last_name age country
1 John Doe 31 USA
2 Robert Luna 22 USA
3 David Robinson 22 UK
4 John Reinhardt 25 UK
5 Betty Doe 28 UAE
sql query runs in online compiler works:
SELECT *
FROM customer
WHERE (first_name, age) IN (
SELECT first_name, min(age)
FROM customer
GROUP BY first_name
)
python script which ends with error
import pandas as pd
import duckdb
conn = duckdb.connect('mydb.db')
df = conn.execute("""
SELECT *
FROM 'customer.csv'
WHERE (first_name, age) IN (
SELECT first_name, min(age)
FROM 'customer.csv'
GROUP BY first_name
)
""").df()
print(df)
Is there a issue with duckDB? Is there some another way to write this query?
DuckDB doesn't appear to allow multi-column IN
. From the documentation:
The IN operator can also be used with a subquery that returns a single column. See the subqueries page for more information.
Instead of IN
, use JOIN
:
SELECT t1.*
FROM 'customer.csv' AS t1
JOIN (
SELECT first_name, min(age) AS minage
FROM 'customer.csv'
GROUP BY first_name
) AS t2 ON t1.first_name = t2.first_name AND t1.age = t2.minage