Search code examples
python-polars

Polars - how to parallelize lambda that uses only Polars expressions?


This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?

(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))

import polars as pl


df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
    .with_columns(
        pl.concat_str(
            pl.lit('['),
            pl.col('doc_ids').map_elements(lambda x: x.cast(pl.String)).list.join(', '),
            pl.lit(']')
        )
        .alias('docs_str')
    )
    .drop('doc_ids')
).collect()

Solution

  • In general, we want to avoid map_elements at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.

    Here's one way that we can eliminate map_elements: we can .cast() directly to pl.List(pl.String)

    (
        df.lazy()
        .with_columns(
            pl.concat_str(
                pl.lit("["),
                pl.col("doc_ids")
                  .cast(pl.List(pl.String))
                  .list.join(", "),
                pl.lit("]"),
            ).alias("docs_str")
        )
        .drop("doc_ids")
        .collect()
    )
    
    shape: (2, 2)
    ┌─────┬──────────┐
    │ ent ┆ docs_str │
    │ --- ┆ ---      │
    │ str ┆ str      │
    ╞═════╪══════════╡
    │ a   ┆ [2, 3]   │
    │ b   ┆ [3]      │
    └─────┴──────────┘