Search code examples
python-polars

How to pivot a lazy dataframe based on the years of a date column?


Sorry for the title but basically I have this input :

data = {
    'text': ['text A', 'text B', 'text A', 'text A', 'text B', 'text C'],
    'length': [7, 1, 4, 0, 2, 0],
    'date': [20210101, 20231003, 20220808, 20210616, 20210203, 20220419]
 }

ldf = pl.LazyFrame(data)

print(ldf.collect())

shape: (6, 3)
┌────────┬────────┬──────────┐
│ text   ┆ length ┆ date     │
│ ---    ┆ ---    ┆ ---      │
│ str    ┆ i64    ┆ i64      │
╞════════╪════════╪══════════╡
│ text A ┆ 7      ┆ 20210101 │
│ text B ┆ 1      ┆ 20231003 │
│ text A ┆ 4      ┆ 20220808 │
│ text A ┆ 0      ┆ 20210616 │
│ text B ┆ 2      ┆ 20210203 │
│ text C ┆ 0      ┆ 20220419 │
└────────┴────────┴──────────┘

I tried with the code below but there is two problems. the years are written manually and there is error :

query = (
    ldf
        .group_by([pl.col('date').str.slice(0, 4).alias('year'), pl.col('text')]).agg(
            pl.col('length').sum().where(pl.col('year') == '2021').alias('2021'),
            pl.col('length').sum().where(pl.col('year') == '2022').alias('2022'),
            pl.col('length').sum().where(pl.col('year') == '2023').alias('2023'),
        )
)

print(query.collect())
ComputeError: column 'year' not available in schema Schema:
name: text, data type: Utf8
name: length, data type: Int64
name: date, data type: Int64

My expected output is this :

 grp_texts  2021  2022  2023
    text A     7     4     0
    text B     2     0     1
    text C     0     0     0

Solution

  • Polars does have a dedicated .pivot() method.

    However, as a pivot operation needs to read all of the data - it doesn't exist in the LazyFrame namespace.

    If it's acceptable for your use-case, you can .collect() and then .pivot()

    (ldf.with_columns(year = pl.col("date").str.slice(0, 4))
        .sort("year")
        .collect()
        .pivot(
           values = "length", 
           index = "text", 
           columns = "year", 
           aggregate_function = "sum"
        )
    )   
    
    shape: (3, 4)
    ┌────────┬──────┬──────┬──────┐
    │ text   ┆ 2021 ┆ 2022 ┆ 2023 │
    │ ---    ┆ ---  ┆ ---  ┆ ---  │
    │ str    ┆ i64  ┆ i64  ┆ i64  │
    ╞════════╪══════╪══════╪══════╡
    │ text A ┆ 7    ┆ 4    ┆ null │
    │ text B ┆ 2    ┆ null ┆ 1    │
    │ text C ┆ null ┆ 0    ┆ null │
    └────────┴──────┴──────┴──────┘