Search code examples
pythonpython-polars

polars: multi-threaded computing of common elements between two columns


I have a huge dataset (~100M rows). As Pandas does not support multi-threading, I am trying to use polars library for analysis. The minimal problem I am trying to solve is shown below:

import polars as pl

df = pl.DataFrame({"col1" : ["abc", "def", "ghi"], "col2": [[1, 2, 3, 4], [1, 2], [9, 10]], "col3": [[1,4], [2,2], [3,4,5]]})
df

The actual values in col2 and col3 are strings that I mapped onto integers hoping to be able to use ufunc froms numpy that will enable multi-thread processing unlike the slow map_rows method.

So far, I could solve it only by using the slow map_rows method.

def comm(lst1, lst2):
    s = set(lst1).intersection(set(lst2))
    s = [str(t) for t in s]
    return '|'.join(s)
    
res = df.map_rows(lambda t: (t[0], t[1], t[2], comm(t[1], t[2])) )
res.columns = ['col1', 'col2', 'col3', 'commom']
res

Produces the required output:

shape: (3, 4)
┌──────┬─────────────┬───────────┬────────┐
│ col1 ┆ col2        ┆ col3      ┆ commom │
│ ---  ┆ ---         ┆ ---       ┆ ---    │
│ str  ┆ list[i64]   ┆ list[i64] ┆ str    │
╞══════╪═════════════╪═══════════╪════════╡
│ abc  ┆ [1, 2, … 4] ┆ [1, 4]    ┆ 1|4    │
│ def  ┆ [1, 2]      ┆ [2, 2]    ┆ 2      │
│ ghi  ┆ [9, 10]     ┆ [3, 4, 5] ┆        │
└──────┴─────────────┴───────────┴────────┘

Any help in parallelizing the code would be appreciated. TIA


Solution

  • Polars has a dedicated .list.set_intersection() method.

    df.with_columns(
        pl.col("col2").list.set_intersection("col3")
          .cast(pl.List(pl.String))
          .list.join("|")
          .alias("common")
    )
    
    shape: (3, 4)
    ┌──────┬─────────────┬───────────┬────────┐
    │ col1 ┆ col2        ┆ col3      ┆ common │
    │ ---  ┆ ---         ┆ ---       ┆ ---    │
    │ str  ┆ list[i64]   ┆ list[i64] ┆ str    │
    ╞══════╪═════════════╪═══════════╪════════╡
    │ abc  ┆ [1, 2, … 4] ┆ [1, 4]    ┆ 1|4    │
    │ def  ┆ [1, 2]      ┆ [2, 2]    ┆ 2      │
    │ ghi  ┆ [9, 10]     ┆ [3, 4, 5] ┆        │
    └──────┴─────────────┴───────────┴────────┘