Search code examples
pythonpython-polars

How to add a new column that has the occurrence number of the value in a column?


Given a Polars Dataframe:

df = pl.DataFrame({'Col_1':['A','B','C','D'],'Col_2':['a','b','c','d'],'Col_3':[1,2,3,4]})
print(df)
shape: (4, 3)
┌───────┬───────┬───────┐
│ Col_1 ┆ Col_2 ┆ Col_3 │
│ ---   ┆ ---   ┆ ---   │
│ str   ┆ str   ┆ i64   │
╞═══════╪═══════╪═══════╡
│ A     ┆ a     ┆ 1     │
│ B     ┆ b     ┆ 2     │
│ C     ┆ c     ┆ 3     │
│ D     ┆ d     ┆ 4     │
└───────┴───────┴───────┘

and a list:

display_list = ['A','B','B','B','C','D','D','A']

I want the output to be in this format:

shape: (8, 4)
┌───────┬───────┬───────┬───────────────┐
│ Col_1 ┆ Col_2 ┆ Col_3 ┆ Occurrence_No │
│ ---   ┆ ---   ┆ ---   ┆ ---           │
│ str   ┆ str   ┆ i64   ┆ i64           │
╞═══════╪═══════╪═══════╪═══════════════╡
│ A     ┆ a     ┆ 1     ┆ 1             │
│ B     ┆ b     ┆ 2     ┆ 1             │
│ B     ┆ b     ┆ 2     ┆ 2             │
│ B     ┆ b     ┆ 2     ┆ 3             │
│ C     ┆ c     ┆ 3     ┆ 1             │
│ D     ┆ d     ┆ 4     ┆ 1             │
│ D     ┆ d     ┆ 4     ┆ 2             │
│ A     ┆ a     ┆ 1     ┆ 1             │
└───────┴───────┴───────┴───────────────┘

I want to the rows to be duplicated based on the number of times the first column appears in the list. Also, I need an Occurrence_No column that acts as the counter of the occurrence of the first column in the dataframe. The order of the rows in the DataFrame does not matter.

I am able to get the result except the Occurrence_No using this code:

df = df.with_columns(pl.col('Col_1').apply(lambda x: display_list.count(x)).alias('occur'))
df = df.select(pl.exclude('occur').repeat_by('occur').explode())

The above codes creates the number of rows based on the number of times it occurs in the list.

How do I add the Occurrence_No column?


Solution

  • You could just do a join of display_list (as a pl.Series -> pl.DataFrame). For Occurence_no, a cumcount over the rle_id.

    pl.Series(name='Col_1', values=display_list).to_frame().join(
        df, on='Col_1'
    ).with_columns(
        Occurence_No=1 + pl.col('Col_1').cumcount().over(pl.col('Col_1').rle_id())
    )