Sorry for the title but basically I have this input :
data = {
'text': ['text A', 'text B', 'text A', 'text A', 'text B', 'text C'],
'length': [7, 1, 4, 0, 2, 0],
'date': [20210101, 20231003, 20220808, 20210616, 20210203, 20220419]
}
ldf = pl.LazyFrame(data)
print(ldf.collect())
shape: (6, 3)
┌────────┬────────┬──────────┐
│ text ┆ length ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞════════╪════════╪══════════╡
│ text A ┆ 7 ┆ 20210101 │
│ text B ┆ 1 ┆ 20231003 │
│ text A ┆ 4 ┆ 20220808 │
│ text A ┆ 0 ┆ 20210616 │
│ text B ┆ 2 ┆ 20210203 │
│ text C ┆ 0 ┆ 20220419 │
└────────┴────────┴──────────┘
I tried with the code below but there is two problems. the years are written manually and there is error :
query = (
ldf
.group_by([pl.col('date').str.slice(0, 4).alias('year'), pl.col('text')]).agg(
pl.col('length').sum().where(pl.col('year') == '2021').alias('2021'),
pl.col('length').sum().where(pl.col('year') == '2022').alias('2022'),
pl.col('length').sum().where(pl.col('year') == '2023').alias('2023'),
)
)
print(query.collect())
ComputeError: column 'year' not available in schema Schema:
name: text, data type: Utf8
name: length, data type: Int64
name: date, data type: Int64
My expected output is this :
grp_texts 2021 2022 2023
text A 7 4 0
text B 2 0 1
text C 0 0 0
Polars does have a dedicated .pivot()
method.
However, as a pivot operation needs to read all of the data - it doesn't exist in the LazyFrame
namespace.
If it's acceptable for your use-case, you can .collect()
and then .pivot()
(ldf.with_columns(year = pl.col("date").str.slice(0, 4))
.sort("year")
.collect()
.pivot(
values = "length",
index = "text",
columns = "year",
aggregate_function = "sum"
)
)
shape: (3, 4)
┌────────┬──────┬──────┬──────┐
│ text ┆ 2021 ┆ 2022 ┆ 2023 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════╪══════╪══════╡
│ text A ┆ 7 ┆ 4 ┆ null │
│ text B ┆ 2 ┆ null ┆ 1 │
│ text C ┆ null ┆ 0 ┆ null │
└────────┴──────┴──────┴──────┘