Working with a dataframe of the type:
df = pl.DataFrame({
'order': [38681.0, 38692.0, 38680.0, 38693.0],
'shipto': ["471433", "239269", "471433","239269"],
'value': [10,20,30,None]
})
Need to Groupby the 'Shipto' column, sum the 'values' and aggregate the 'order' to a list. Have tried a few things but can't get it work.
The base script:
df = (df
.with_columns([
pl.col('order').cast(pl.Utf8),
pl.col(pl.Float64).fill_null(0)
])
.groupby('shipto')
.agg([
pl.col('order').apply(lambda x: str(x)).alias('order_list'),
pl.sum('value')
])
)
Returns:
shipto | order_list | value |
---|---|---|
str | str | i64 |
471433 | shape: (2,) | 40 |
Series: '' [f64] | ||
[ | ||
... | ||
239269 | shape: (2,) | 20 |
Series: '' [f64] | ||
[ | ||
... |
What I am hoping to get in the 'order_list' column is either ([38681.0,38680.0],[38692.0,38693.0]) or (['38681.0','38680.0'],['38692.0','38693.0'])
I am guessing the 'order' column needs to be converted from f64 values to string (Utf8), but can't get that that to work.
Variations to the line 'pl.col('order').cast(pl.Utf8), #.cast(pl.Float64)' that I have tried so far:
pl.col('order').cast(pl.Float64).cast(pl.Utf8),
pl.col('order').cast(pl.Int64).cast(pl.Utf8),
pl.col('order').map(lambda x: str(x)),
pl.col('order').apply(lambda x: str(int(x))),
pl
.when(pl.col('order').is_null())
.then(pl.lit(''))
.otherwise(pl.col('order').cast(pl.Float64).cast(pl.Utf8)).alias('order'),
Sure there is some basic error, but continuing to struggle with this and any help would be greatly appreciated.
if you do not specify an aggregation function inside of agg, the result will be a list.
This might be what you are looking for.
df.group_by('shipto').agg(
'order',
pl.sum('value')
)
# Result
shape: (2, 3)
┌────────┬────────────────────┬───────┐
│ shipto ┆ order ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ list[f64] ┆ i64 │
╞════════╪════════════════════╪═══════╡
│ 471433 ┆ [38681.0, 38680.0] ┆ 40 │
│ 239269 ┆ [38692.0, 38693.0] ┆ 20 │
└────────┴────────────────────┴───────┘