Search code examples
pythonpython-polars

python polars: df partition with pivot and concat


my goal was to groupby/partition by one column (a below), create a string identifier (b and c columns) then use this b_c identifier as a name for a column in a pivoted data frame. Code below works OK as far as I can tell, but the path to get the result is a bit twisted. So my question is: can this be done in a simpler way? BTW, at this tiny scale (max 1k of rows so far) I am not obsessed to make it faster.

data = {
    "a": [1, 1, 1, 2, 2, 3],
    "b": [11, 12, 13, 11, 12, 11],
    "c": ["x1", "x2", "x3", "x1", "x2", "x1"],
    "val": [101, 102, 102, 201, 202, 301],
}
df = pl.DataFrame(data)

print(df)

counter = 0
for tmp_df in df.partition_by("a"):
    grp_df = (
        tmp_df.with_columns((pl.col("b").cast(pl.String) + "_" + pl.col("c")).alias("col_id"))
        .drop("b", "c")
        .pivot("col_id", index="a")
    )

    if counter == 0:
        result_df = grp_df.select(pl.all())
    else:
        result_df = pl.concat([result_df, grp_df], how="diagonal")
    counter += 1

print(result_df)

Output:

shape: (3, 4)
┌─────┬───────┬───────┬───────┐
│ a   ┆ 11_x1 ┆ 12_x2 ┆ 13_x3 │
│ --- ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ i64   ┆ i64   ┆ i64   │
╞═════╪═══════╪═══════╪═══════╡
│ 1   ┆ 101   ┆ 102   ┆ 102   │
│ 2   ┆ 201   ┆ 202   ┆ null  │
│ 3   ┆ 301   ┆ null  ┆ null  │
└─────┴───────┴───────┴───────┘

Solution

  • you can do this in 2 step: first a select step to create the new id column, then the pivot.

    Example 1:

    (
        df.select(
            'a','val',
            id = pl.col('b').cast(pl.String) + '_' + pl.col('c'))
        .pivot('id', index='a')
    )
    
    # Result
    shape: (3, 4)
    ┌─────┬───────┬───────┬───────┐
    │ a   ┆ 11_x1 ┆ 12_x2 ┆ 13_x3 │
    │ --- ┆ ---   ┆ ---   ┆ ---   │
    │ i64 ┆ i64   ┆ i64   ┆ i64   │
    ╞═════╪═══════╪═══════╪═══════╡
    │ 1   ┆ 101   ┆ 102   ┆ 102   │
    │ 2   ┆ 201   ┆ 202   ┆ null  │
    │ 3   ┆ 301   ┆ null  ┆ null  │
    └─────┴───────┴───────┴───────┘
    

    Example 2: (suggested by @jqurious), using pl.format

    (
        df.select(
            'a','val',
            id = pl.format("{}_{}", "b", "c"))
        .pivot('id', index='a')
    )