Search code examples
pythonvectorizationpython-polars

How can I avoid using pl.DataFrame.iter_rows() and instead vectorize this


I have two polars dataframes containing a unique ID and the name of a utility. I am trying to build a mapping of entries between these two dataframes. I am using polars_fuzzy_match to do a fuzzy string search against entries. My first dataframe (wg_df) is approximately a subset of the second (eia_df). In my code below I am passing each utility_name from wg_df into fuzzy_match_score run against the eia_utility_name. Can I avoid the rowise iteration and vectorize this?

import polars as pl
from polars_fuzzy_match import fuzzy_match_score

# Sample data
#  wg_df is approximately a subset of eia_df.
wg_df = pl.DataFrame({"wg_id": [1, 2], "utility_name": ["Utility A", "Utility B"]})

eia_df = pl.DataFrame(
    {"eia_id": [101, 102, 103], "utility_name": ["Utility A co.", "Utility B", "utility c"]}
)

out = pl.DataFrame(
    schema=[
        ("wg_id", pl.Int64),
        ("eia_id", pl.Int64),
        ("wg_utility_name", pl.String),
        ("utility_name", pl.String),
        ("score", pl.UInt32),
    ],
)

# Iterate through each wg utility and find the best match in eia
# can this be vectorized?
for wg_id, utility in wg_df.iter_rows():
    res = (
        eia_df.with_columns(score=fuzzy_match_score(pl.col("utility_name"), utility))
        .filter(pl.col("score").is_not_null())
        .sort(by="score", descending=True)
    )
    # insert the wg_id and wg_utility_name into the results. They have to be put into the 
    res.insert_column(
        0,
        pl.Series("wg_id", [wg_id] * len(res)),
    )
    res.insert_column(2, pl.Series("wg_utility_name", [utility] * len(res)))
    out = out.vstack(res.select([col_name for col_name in out.schema]))

Solution

  • polars-fuzzy-match would need to add support for vectorization. (i.e. at the Rust level)

    The polars_ds plugin has vectorized functions backed by the impressive RapidFuzz library.

    import polars_ds as pds
    
    (eia_df
      .lazy()
      .join(wg_df.lazy(), how="cross")
      .with_columns(
          score = pds.str_fuzz(pl.col.utility_name, pl.col.utility_name_right),
      )
      .collect()
    )
    
    shape: (6, 5)
    ┌────────┬───────────────┬───────┬────────────────────┬──────────┐
    │ eia_id ┆ utility_name  ┆ wg_id ┆ utility_name_right ┆ score    │
    │ ---    ┆ ---           ┆ ---   ┆ ---                ┆ ---      │
    │ i64    ┆ str           ┆ i64   ┆ str                ┆ f64      │
    ╞════════╪═══════════════╪═══════╪════════════════════╪══════════╡
    │ 101    ┆ Utility A co. ┆ 1     ┆ Utility A          ┆ 0.818182 │
    │ 101    ┆ Utility A co. ┆ 2     ┆ Utility B          ┆ 0.727273 │ 
    │ 102    ┆ Utility B     ┆ 1     ┆ Utility A          ┆ 0.888889 │
    │ 102    ┆ Utility B     ┆ 2     ┆ Utility B          ┆ 1.0      │
    │ 103    ┆ utility c     ┆ 1     ┆ Utility A          ┆ 0.777778 │
    │ 103    ┆ utility c     ┆ 2     ┆ Utility B          ┆ 0.777778 │
    └────────┴───────────────┴───────┴────────────────────┴──────────┘
    

    It seems polars-fuzzy-match uses the nucleo library which says it uses the Smith-Waterman algorithm.

    It seems like it only gives a scores if they are actual substrings?

    (eia_df
      .lazy()
      .join(wg_df.lazy(), how="cross")
      .with_columns(
          score = pds.str_fuzz(pl.col.utility_name, pl.col.utility_name_right),
      )
      .with_columns(
         a = 
            pl.col.utility_name
              .str.to_lowercase()
              .str.contains(
                 pl.col.utility_name_right.str.to_lowercase(), 
                 literal = True
              ),
         b = 
            pl.col.utility_name_right.str.to_lowercase()
              .str.contains(
                 pl.col.utility_name.str.to_lowercase(), 
                 literal = True
            )
      )
      .filter(pl.col.a | pl.col.b)
      .collect()
    )
    
    shape: (2, 7)
    ┌────────┬───────────────┬───────┬────────────────────┬──────────┬──────┬───────┐
    │ eia_id ┆ utility_name  ┆ wg_id ┆ utility_name_right ┆ score    ┆ a    ┆ b     │
    │ ---    ┆ ---           ┆ ---   ┆ ---                ┆ ---      ┆ ---  ┆ ---   │
    │ i64    ┆ str           ┆ i64   ┆ str                ┆ f64      ┆ bool ┆ bool  │
    ╞════════╪═══════════════╪═══════╪════════════════════╪══════════╪══════╪═══════╡
    │ 101    ┆ Utility A co. ┆ 1     ┆ Utility A          ┆ 0.818182 ┆ true ┆ false │
    │ 102    ┆ Utility B     ┆ 2     ┆ Utility B          ┆ 1.0      ┆ true ┆ true  │
    └────────┴───────────────┴───────┴────────────────────┴──────────┴──────┴───────┘
    

    1. https://polars-ds-extension.readthedocs.io/en/latest/string.html#polars_ds.string.str_fuzz
    2. https://github.com/rapidfuzz/rapidfuzz-rs
    3. https://github.com/helix-editor/nucleo
    4. https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
    5. https://github.com/rapidfuzz/RapidFuzz/issues/175