Reference polars.DataFrame.height in with_columns

Take this example:

df = (polars
  .DataFrame(dict(
    j=polars.datetime_range(datetime.date(2023, 1, 1), datetime.date(2023, 1, 3), '8h', closed='left', eager=True),
    ))
  .with_columns(
    k=polars.lit(numpy.random.randint(10, 99, 6)),
    )
  )

 j                    k
 2023-01-01 00:00:00  47
 2023-01-01 08:00:00  22
 2023-01-01 16:00:00  82
 2023-01-02 00:00:00  19
 2023-01-02 08:00:00  85
 2023-01-02 16:00:00  15
shape: (6, 2)

Here, numpy.random.randint(10, 99, 6) uses hard-coded 6 as the height of DataFrame, so it won't work if I changed e.g. the interval from 8h to 4h (which would require changing 6 to 12).

I know I can do it by breaking the chain:

df = polars.DataFrame(dict(
  j=polars.datetime_range(datetime.date(2023, 1, 1), datetime.date(2023, 1, 3), '4h', closed='left', eager=True),
  ))

df = df.with_columns(
  k=polars.lit(numpy.random.randint(10, 99, df.height)),
  )

 j                    k
 2023-01-01 00:00:00  47
 2023-01-01 04:00:00  22
 2023-01-01 08:00:00  82
 2023-01-01 12:00:00  19
 2023-01-01 16:00:00  85
 2023-01-01 20:00:00  15
 2023-01-02 00:00:00  89
 2023-01-02 04:00:00  74
 2023-01-02 08:00:00  26
 2023-01-02 12:00:00  11
 2023-01-02 16:00:00  86
 2023-01-02 20:00:00  81
shape: (12, 2)

Is there a way to do it (i.e. reference df.height or an equivalent) in one chained expression though?

Solution

You can use .pipe()

df = (
   pl.datetime_range(
      datetime.date(2023, 1, 1), 
      datetime.date(2023, 1, 3), 
      "4h", 
      closed="left", 
      eager=True
   )
   .alias("date")
   .to_frame()
)

df.pipe(lambda df: 
    df.with_columns(pl.lit(np.random.randint(10, 99, df.height)).alias("rand"))
)

shape: (12, 2)
┌─────────────────────┬──────┐
│ date                ┆ rand │
│ ---                 ┆ ---  │
│ datetime[μs]        ┆ i64  │
╞═════════════════════╪══════╡
│ 2023-01-01 00:00:00 ┆ 39   │
│ 2023-01-01 04:00:00 ┆ 45   │
│ 2023-01-01 08:00:00 ┆ 95   │
│ 2023-01-01 12:00:00 ┆ 72   │
│ …                   ┆ …    │
│ 2023-01-02 08:00:00 ┆ 34   │
│ 2023-01-02 12:00:00 ┆ 42   │
│ 2023-01-02 16:00:00 ┆ 30   │
│ 2023-01-02 20:00:00 ┆ 83   │
└─────────────────────┴──────┘

As for the example task, perhaps .sample() could be used.

df.with_columns(
   pl.int_range(10, 100).sample(pl.len(), with_replacement=True).alias("rand")
)

shape: (12, 2)
┌─────────────────────┬──────┐
│ date                ┆ rand │
│ ---                 ┆ ---  │
│ datetime[μs]        ┆ i64  │
╞═════════════════════╪══════╡
│ 2023-01-01 00:00:00 ┆ 25   │
│ 2023-01-01 04:00:00 ┆ 27   │
│ 2023-01-01 08:00:00 ┆ 68   │
│ 2023-01-01 12:00:00 ┆ 95   │
│ 2023-01-01 16:00:00 ┆ 96   │
│ …                   ┆ …    │
│ 2023-01-02 04:00:00 ┆ 36   │
│ 2023-01-02 08:00:00 ┆ 25   │
│ 2023-01-02 12:00:00 ┆ 90   │
│ 2023-01-02 16:00:00 ┆ 92   │
│ 2023-01-02 20:00:00 ┆ 92   │
└─────────────────────┴──────┘