I have a huge dataset (~100M rows). As Pandas does not support multi-threading, I am trying to use polars library for analysis. The minimal problem I am trying to solve is shown below:
import polars as pl
df = pl.DataFrame({"col1" : ["abc", "def", "ghi"], "col2": [[1, 2, 3, 4], [1, 2], [9, 10]], "col3": [[1,4], [2,2], [3,4,5]]})
df
The actual values in col2
and col3
are strings that I mapped onto integers hoping to be able to use ufunc
froms numpy
that will enable multi-thread processing unlike the slow map_rows
method.
So far, I could solve it only by using the slow map_rows
method.
def comm(lst1, lst2):
s = set(lst1).intersection(set(lst2))
s = [str(t) for t in s]
return '|'.join(s)
res = df.map_rows(lambda t: (t[0], t[1], t[2], comm(t[1], t[2])) )
res.columns = ['col1', 'col2', 'col3', 'commom']
res
Produces the required output:
shape: (3, 4)
┌──────┬─────────────┬───────────┬────────┐
│ col1 ┆ col2 ┆ col3 ┆ commom │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ list[i64] ┆ str │
╞══════╪═════════════╪═══════════╪════════╡
│ abc ┆ [1, 2, … 4] ┆ [1, 4] ┆ 1|4 │
│ def ┆ [1, 2] ┆ [2, 2] ┆ 2 │
│ ghi ┆ [9, 10] ┆ [3, 4, 5] ┆ │
└──────┴─────────────┴───────────┴────────┘
Any help in parallelizing the code would be appreciated. TIA
Polars has a dedicated .list.set_intersection()
method.
df.with_columns(
pl.col("col2").list.set_intersection("col3")
.cast(pl.List(pl.String))
.list.join("|")
.alias("common")
)
shape: (3, 4)
┌──────┬─────────────┬───────────┬────────┐
│ col1 ┆ col2 ┆ col3 ┆ common │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ list[i64] ┆ str │
╞══════╪═════════════╪═══════════╪════════╡
│ abc ┆ [1, 2, … 4] ┆ [1, 4] ┆ 1|4 │
│ def ┆ [1, 2] ┆ [2, 2] ┆ 2 │
│ ghi ┆ [9, 10] ┆ [3, 4, 5] ┆ │
└──────┴─────────────┴───────────┴────────┘