Search code examples
pythondataframegroup-bypython-polarssample

Resampling By Group in Polars


I'm trying to build a Monte Carlo simulator for my data in Polars. I am attempting to group by a column, resample the groups and then, unpack the aggregation lists back in their original sequence. I've got it worked out up until the last step and I'm stuck and beginning to think I've gone about this in the wrong way.

df_original = pl.DataFrame({
   'colA': ['A','A','B','B','C','C'],
   'colB': [11,12,13,14,15,16],
   'colC': [21,22,23,24,25,26]})
shape: (6, 3)
┌──────┬──────┬──────┐
│ colA ┆ colB ┆ colC │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ A    ┆ 11   ┆ 21   │
│ A    ┆ 12   ┆ 22   │
│ B    ┆ 13   ┆ 23   │
│ B    ┆ 14   ┆ 24   │
│ C    ┆ 15   ┆ 25   │
│ C    ┆ 16   ┆ 26   │
└──────┴──────┴──────┘

I am grouping and resampling like this. Please note that I am using a seed here so this example is reproducible but this would get run many many times with no seeds in he end.

df_resampled = (
   df_original
      .group_by('colA', maintain_order=True)
      .agg(pl.all())
      .sample(fraction=1.0, shuffle=True, seed=9)
)
shape: (3, 3)
┌──────┬───────────┬───────────┐
│ colA ┆ colB      ┆ colC      │
│ ---  ┆ ---       ┆ ---       │
│ str  ┆ list[i64] ┆ list[i64] │
╞══════╪═══════════╪═══════════╡
│ B    ┆ [13, 14]  ┆ [23, 24]  │
│ C    ┆ [15, 16]  ┆ [25, 26]  │
│ A    ┆ [11, 12]  ┆ [21, 22]  │
└──────┴───────────┴───────────┘

What I can't figure out is how to explode the lists and end up with this. The original order within each group is preserved. Only the groups themselves are reshuffled on each run.

shape: (6, 3)
┌──────┬──────┬──────┐
│ colA ┆ colB ┆ colC │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ B    ┆ 13   ┆ 23   │
│ B    ┆ 14   ┆ 24   │
│ C    ┆ 15   ┆ 25   │
│ C    ┆ 16   ┆ 26   │
│ A    ┆ 11   ┆ 21   │
│ A    ┆ 12   ┆ 22   │
└──────┴──────┴──────┘

Solution

  • As @jqurious pointed out in the comments, this is easily solved with...

    df_resampled.explode(pl.exclude("colA"))