Search code examples
pythonvectorizationpython-polars

How can I avoid using pl.DataFrame.iter_rows() and instead vectorize this


I have two polars dataframes containing a unique ID and the name of a utility. I am trying to build a mapping of entries between these two dataframes. I am using polars_fuzzy_match to do a fuzzy string search against entries. My first dataframe (wg_df) is approximately a subset of the second (eia_df). In my code below I am passing each utility_name from wg_df into fuzzy_match_score run against the eia_utility_name. Can I avoid the rowise iteration and vectorize this?

import polars as pl
from polars_fuzzy_match import fuzzy_match_score

# Sample data
#  wg_df is approximately a subset of eia_df.
wg_df = pl.DataFrame({"wg_id": [1, 2], "utility_name": ["Utility A", "Utility B"]})

eia_df = pl.DataFrame(
    {"eia_id": [101, 102, 103], "utility_name": ["Utility A co.", "Utility B", "utility c"]}
)

out = pl.DataFrame(
    schema=[
        ("wg_id", pl.Int64),
        ("eia_id", pl.Int64),
        ("wg_utility_name", pl.String),
        ("utility_name", pl.String),
        ("score", pl.UInt32),
    ],
)

# Iterate through each wg utility and find the best match in eia
# can this be vectorized?
for wg_id, utility in wg_df.iter_rows():
    res = (
        eia_df.with_columns(score=fuzzy_match_score(pl.col("utility_name"), utility))
        .filter(pl.col("score").is_not_null())
        .sort(by="score", descending=True)
    )
    # insert the wg_id and wg_utility_name into the results. They have to be put into the 
    res.insert_column(
        0,
        pl.Series("wg_id", [wg_id] * len(res)),
    )
    res.insert_column(2, pl.Series("wg_utility_name", [utility] * len(res)))
    out = out.vstack(res.select([col_name for col_name in out.schema]))

Solution

  • This will parallelize the work.

    def make_res(wg_id, utility, eia_df):
        return (
            eia_df.lazy()
            .select(
                pl.lit(wg_id).alias('wg_id'),
                pl.all(),
                other_utility_name = pl.lit(utility),
                score=fuzzy_match_score(pl.col("utility_name"), utility),
                )
            .filter(pl.col("score").is_not_null())
        )
        
    pl.concat([
        make_res(wg_id, utility, eia_df) for wg_id, utility in wg_df.iter_rows()
    ]).collect()
        
    
    shape: (2, 5)
    ┌───────┬────────┬───────────────┬────────────────────┬───────┐
    │ wg_id ┆ eia_id ┆ utility_name  ┆ other_utility_name ┆ score │
    │ ---   ┆ ---    ┆ ---           ┆ ---                ┆ ---   │
    │ i32   ┆ i64    ┆ str           ┆ str                ┆ u32   │
    ╞═══════╪════════╪═══════════════╪════════════════════╪═══════╡
    │ 1     ┆ 101    ┆ Utility A co. ┆ Utility A          ┆ 228   │
    │ 2     ┆ 102    ┆ Utility B     ┆ Utility B          ┆ 228   │
    └───────┴────────┴───────────────┴────────────────────┴───────┘
    

    I don't know anything about this plugin and whether it's expected that both results are 228.

    Setting that aside since it's actually not related to the issue, there are two keys to this approach. The first is that we aren't vstacking for every iteration and the second is that when you concat a list of LazyFrames and then collect them, polars will execute each LazyFrame plan in parallel.