Search code examples
pythonduckdb

"n_unique" aggregation using DuckDB relational API


Say I have

import duckdb

rel = duckdb.sql('select * from values (1, 4), (1, 5), (2, 6) df(a, b)')
rel
Out[3]: 
┌───────┬───────┐
│   a   │   b   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │     4 │
│     1 │     5 │
│     2 │     6 │
└───────┴───────┘

I can group by a and find the mean of 'b' by doing:

rel.aggregate(
    [duckdb.FunctionExpression('mean', duckdb.ColumnExpression('b'))],
    group_expr='a',
)
┌─────────┐
│ mean(b) │
│ double  │
├─────────┤
│     4.5 │
│     6.0 │
└─────────┘

which works wonderfully

Is there a similar way to create a "n_unique" aggregation? I'm looking for something like

rel.aggregate(
    [duckdb.FunctionExpression('count_distinct', duckdb.ColumnExpression('b'))],
    group_expr='a',
)

but that doesn't exist. Is there something that does?


Solution

  • updated. I couldn't find proper way of doing count distinct, but you could use combination of array_agg() and array_unique() functions:

    rel.aggregate(
        [duckdb.FunctionExpression(
            'array_unique',
            duckdb.FunctionExpression(
                'array_agg',
                duckdb.ColumnExpression('b')
            )
        )],
        group_expr='a',
    )
    
    ┌────────────────────────────┐
    │ array_unique(array_agg(b)) │
    │           uint64           │
    ├────────────────────────────┤
    │                          1 │
    │                          2 │
    └────────────────────────────┘
    

    old. you can pre-select distinct a and b columns?

    (
        rel.select(*[duckdb.ColumnExpression('a'), duckdb.ColumnExpression('b')])
        .distinct()
        .aggregate(
            [duckdb.FunctionExpression('count', duckdb.ColumnExpression('b'))],
            group_expr='a',
        )
    )