Search code examples
python-polars

Convert Polars DataFrame to use date labels for columns?


I am interested in converting some of the Pandas codebase I work in to Polars. In that codebase, we frequently use Pandas DataFrames which look as follows, where the rows represent an amount for a user each month:

              2023-09-01  2023-10-01  ...  2024-09-01  2024-10-01
customer_id                                                                                                                                                                         
111111          0.000000    0.000000  ...  918.333333  918.333333
222222        466.666667  883.333333  ...    0.000000    0.000000
333333          0.000000  833.333333  ...  833.333333    0.000000
444444          0.000000    0.000000  ...  833.333333  833.333333

I currently have a Polars DataFrame which contains similar information, just encoded in rows rather than columns:

shape: (313_590, 3)
┌─────────────┬─────────────────────┬────────────┐
│ customer_id ┆ date                ┆ amount     │
│ ---         ┆ ---                 ┆ ---        │
│ i64         ┆ datetime[μs]        ┆ f64        │
╞═════════════╪═════════════════════╪════════════╡
│ 111111      ┆ 2023-01-01 00:00:00 ┆ 80.749008  │
│ 222222      ┆ 2023-06-01 00:00:00 ┆ 87.628968  │
│ 333333      ┆ 2023-02-01 00:00:00 ┆ 180.327381 │
│ 333333      ┆ 2022-06-01 00:00:00 ┆ 180.327381 │
│ …           ┆ …                   ┆ …          │
│ 555555      ┆ 2022-05-01 00:00:00 ┆ 85.818452  │
│ 666666      ┆ 2022-06-01 00:00:00 ┆ 85.818452  │
│ 777777      ┆ 2023-11-01 00:00:00 ┆ 87.628968  │
│ 888888      ┆ 2023-12-01 00:00:00 ┆ 87.628968  │
└─────────────┴─────────────────────┴────────────┘

In this example, customer 111111 might have 12 rows, one for each month of data which exists for them. Is there a way in Polars to convert this DataFrame into the one more like above, where column labels can just be the ISO format string of the date?

Thanks in advance!


Solution

  • You can do this with

    (
    df
    .sort('date')
    .with_columns(pl.col('date').dt.strftime("%Y-%m-%d"))
    .pivot('amount', 'customer_id','date','first')
    .with_columns(pl.col(pl.Float64).fill_null(0))
    )
    shape: (7, 8)
    ┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
    │ customer_i ┆ 2022-05-01 ┆ 2022-06-01 ┆ 2023-01-0 ┆ 2023-02-0 ┆ 2023-06-0 ┆ 2023-11-0 ┆ 2023-12-0 │
    │ d          ┆ ---        ┆ ---        ┆ 1         ┆ 1         ┆ 1         ┆ 1         ┆ 1         │
    │ ---        ┆ f64        ┆ f64        ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
    │ i64        ┆            ┆            ┆ f64       ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
    ╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
    │ 555555     ┆ 85.818452  ┆ 0.0        ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 0.0       │
    │ 333333     ┆ 0.0        ┆ 180.327381 ┆ 0.0       ┆ 180.32738 ┆ 0.0       ┆ 0.0       ┆ 0.0       │
    │            ┆            ┆            ┆           ┆ 1         ┆           ┆           ┆           │
    │ 666666     ┆ 0.0        ┆ 85.818452  ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 0.0       │
    │ 111111     ┆ 0.0        ┆ 0.0        ┆ 80.749008 ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 0.0       │
    │ 222222     ┆ 0.0        ┆ 0.0        ┆ 0.0       ┆ 0.0       ┆ 87.628968 ┆ 0.0       ┆ 0.0       │
    │ 777777     ┆ 0.0        ┆ 0.0        ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 87.628968 ┆ 0.0       │
    │ 888888     ┆ 0.0        ┆ 0.0        ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 0.0       ┆ 87.628968 │
    └────────────┴────────────┴────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
    

    One note, pivots are great to help display data but they're usually not the best first step if you've got calculations to do. For instance, if you wanted the sum across customers per month it might seem like you need the pivot first but it'd be better to just do a group_by on the date.