Say I have
import duckdb
rel = duckdb.sql('select * from values (1, 4), (1, 5), (2, 6) df(a, b)')
rel
Out[3]:
┌───────┬───────┐
│ a │ b │
│ int32 │ int32 │
├───────┼───────┤
│ 1 │ 4 │
│ 1 │ 5 │
│ 2 │ 6 │
└───────┴───────┘
I can group by a and find the mean of 'b' by doing:
rel.aggregate(
[duckdb.FunctionExpression('mean', duckdb.ColumnExpression('b'))],
group_expr='a',
)
┌─────────┐
│ mean(b) │
│ double │
├─────────┤
│ 4.5 │
│ 6.0 │
└─────────┘
which works wonderfully
Is there a similar way to create a "n_unique" aggregation? I'm looking for something like
rel.aggregate(
[duckdb.FunctionExpression('count_distinct', duckdb.ColumnExpression('b'))],
group_expr='a',
)
but that doesn't exist. Is there something that does?
updated.
I couldn't find proper way of doing count distinct
, but you could use combination of array_agg()
and array_unique()
functions:
rel.aggregate(
[duckdb.FunctionExpression(
'array_unique',
duckdb.FunctionExpression(
'array_agg',
duckdb.ColumnExpression('b')
)
)],
group_expr='a',
)
┌────────────────────────────┐
│ array_unique(array_agg(b)) │
│ uint64 │
├────────────────────────────┤
│ 1 │
│ 2 │
└────────────────────────────┘
old.
you can pre-select distinct a
and b
columns?
(
rel.select(*[duckdb.ColumnExpression('a'), duckdb.ColumnExpression('b')])
.distinct()
.aggregate(
[duckdb.FunctionExpression('count', duckdb.ColumnExpression('b'))],
group_expr='a',
)
)