I have a polars dataframe as below:
Example input:
import polars as pl
df = pl.select(user_id=1, items=[1, 2, 3, 4], popular_items=[3, 4, 5, 6])
┌─────────────┬─────────────┬───────────────┐
│ user_id ┆ items ┆ popular_items │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╡
│ 1 ┆[1, 2, 3, 4] ┆ [3, 4, 5, 6] │
└─────────────┴─────────────┴───────────────┘
I want to filter popular_items
column by removing any items that are in items
column for each user_id
I have been trying to get it to work but have been unsuccessful due to various issues. In all likelihood, I am probably overcomplicating things.
The expected output should be as follows:
┌─────────────┬─────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, 3, 4]┆ [3, 4, 5, 6] ┆ [5, 6] │
└─────────────┴─────────────┴───────────────┴───────────┘
It seems like the solution should be simple, but it seems to escape me after some time now trying different things.
Any help would be greatly appreciated!
Update: .list.set_difference()
has since been added to Polars.
df.with_columns(
suggested = pl.col("popular_items").list.set_difference("items")
)
shape: (1, 4)
┌─────────┬──────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪══════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, 3, 4] ┆ [3, 4, 5, 6] ┆ [6, 5] │
└─────────┴──────────────┴───────────────┴───────────┘