Search code examples
pythonsqlduckdb

Subquery returning multiple columns in duckDB


I would like to group by first_name and for each fist_name get the lowest age.
My query which I run in online sql compiler works fine but when I try to use duckDB in python I get error that I try to return multiple columns, but this is exactly what I want:

duckdb.BinderException: Binder Error: Subquery returns 2 columns - expected 1
LINE 4: WHERE (first_name, age) IN (

My data:

customer_id  first_name  last_name  age  country
1            John        Doe        31   USA
2            Robert      Luna       22   USA
3            David       Robinson   22   UK
4            John        Reinhardt  25   UK
5            Betty       Doe        28   UAE

sql query runs in online compiler works:

SELECT * 
FROM customer
WHERE (first_name, age) IN (
  SELECT first_name, min(age)
  FROM customer
  GROUP BY first_name
)

python script which ends with error

import pandas as pd
import duckdb


conn = duckdb.connect('mydb.db')
df = conn.execute("""
SELECT * 
FROM 'customer.csv'
WHERE (first_name, age) IN (
  SELECT first_name, min(age)
  FROM 'customer.csv'
  GROUP BY first_name
)
""").df()

print(df)

Is there a issue with duckDB? Is there some another way to write this query?


Solution

  • DuckDB doesn't appear to allow multi-column IN. From the documentation:

    The IN operator can also be used with a subquery that returns a single column. See the subqueries page for more information.

    Instead of IN, use JOIN:

    SELECT t1.*
    FROM 'customer.csv' AS t1
    JOIN (
      SELECT first_name, min(age) AS minage
      FROM 'customer.csv'
      GROUP BY first_name
    ) AS t2 ON t1.first_name = t2.first_name AND t1.age = t2.minage