The code below will create a column called paid that looks like a list, but is an object, and thus practically useless as a column. How can I ensure that the created column is a list column rather than an object column since .cast()
cannot be applied to the object column after is has been created.
import numpy as np
import polars as pl
import scipy.stats as stats
CLUSTERS = 200
MEAN_TRIALS = 20
MU = 0.5
SIGMA = 0.1
df_cluster = pl.DataFrame({'cluster_id': range(1, CLUSTERS+1)})
df_cluster = df_cluster.with_columns(
mu = stats.truncnorm(a=0, b=1, loc=MU, scale=SIGMA).rvs(size=CLUSTERS),
trials = np.random.poisson(lam=MEAN_TRIALS, size=CLUSTERS)
)
df_cluster = df_cluster.with_columns(
pl.struct(["mu", "trials"])
.map_elements(lambda x: np.random.binomial(n=1, p=x['mu'], size=x['trials']))
.alias('paid')
)
df_cluster.head()
you can create list()
out of ndarray
before returning it:
...
df_cluster = df_cluster.with_columns(
pl.struct(["mu", "trials"])
.map_elements(
lambda x: list(np.random.binomial(n=1, p=x['mu'], size=x['trials']))
)
.alias('paid')
)
df_cluster.head()
┌────────────┬──────────┬────────┬─────────────┐
│ cluster_id ┆ mu ┆ trials ┆ paid │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ i32 ┆ list[i32] │
╞════════════╪══════════╪════════╪═════════════╡
│ 1 ┆ 0.508726 ┆ 25 ┆ [1, 0, … 1] │
│ 2 ┆ 0.513275 ┆ 26 ┆ [1, 1, … 0] │
│ 3 ┆ 0.57244 ┆ 22 ┆ [1, 0, … 1] │
│ 4 ┆ 0.556384 ┆ 15 ┆ [0, 0, … 0] │
│ 5 ┆ 0.51955 ┆ 15 ┆ [1, 1, … 1] │
└────────────┴──────────┴────────┴─────────────┘