This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids
field in every row into its string representation, s.t. [1, 2, 3]
(list[int]) -> '[1, 2, 3]'
(string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_columns(
pl.concat_str(
pl.lit('['),
pl.col('doc_ids').map_elements(lambda x: x.cast(pl.String)).list.join(', '),
pl.lit(']')
)
.alias('docs_str')
)
.drop('doc_ids')
).collect()
In general, we want to avoid map_elements
at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here's one way that we can eliminate map_elements
: we can .cast()
directly to pl.List(pl.String)
(
df.lazy()
.with_columns(
pl.concat_str(
pl.lit("["),
pl.col("doc_ids")
.cast(pl.List(pl.String))
.list.join(", "),
pl.lit("]"),
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
│ b ┆ [3] │
└─────┴──────────┘