Search code examples
pythonpython-polars

Create unique id column for each pair of (col_x, col_y) in polars Python


I have a polars dataframe with subject_id, timestamp, event, col1, and col2 columns.

I want to split this dataframe into two polars dataframe (one with subject_id, timestamp, event and one with subject_id, timestamp, col1, col2), but create a column for a unique id before splitting such that I can use that id to join the split dataframes after grouping/manipulating separately.

How can I create this unique id column in polars where there is a unique id for every unique subject_id, timestamp pair in the dataframe before splitting?

Essentially, I wish to do what this post provided, but in Polars. I understand Polars does not have indexes, so what is the best approach?


Solution

  • Looks like I just had to do a bit more digging - it's helpful to try to find a solution in pandas first then try to replicate using polars. Answer from this post:

    (
        # Add row index.
        df.with_row_index()
        # Group on id and cat column. 
        .group_by(
            ["id", "cat"],
            maintain_order=True,
        )
        .agg(
            # Create a list of all index positions per group.
            pl.col("index")
        )
        # Add a new row count for each group.
        .with_row_index("ngroup")
        # Expand index list column to separate rows.
        .explode("index")
        # Reorder columns.
        .select("index", "ngroup", "id", "cat")
        # Optionally sort by original order.
        .sort("index")
    )