I have a dataframe, where one column a
is a list, and another column b
contains a value that's in a
. I need to create the column c
which contains index of the element in b
in list a
df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
print(df)
shape: (3, 2)
┌───────────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 3 │
│ [4, 5, 2] ┆ 4 │
│ [6, 2, 7] ┆ 2 │
└───────────┴─────┘
so resulting dataframe looks like following
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
all elements of a
are unique within the row, and b
is guaranteed to be in a
.
We want to make two helper columns. one is just a copy of a
and the other is the index values of a
. We then explode by each of them and then filter for when b
==acopy
. Lastly, we clean it up by dropping our acopy
(
df
.with_columns(
pl.col('a').alias('acopy'),
pl.int_ranges(pl.col('a').list.len()).alias('a.index(b)'))
.explode('acopy','a.index(b)')
.filter(pl.col('acopy')==pl.col('b'))
.drop('acopy')
)
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
We can make this robust to dropping your guarantees about uniqueness (taking the first instance) and also allowing for the b
value to not be in a
while still returning the row with a null.
So, say we started with:
df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})
then we could do
(
df
.with_row_index('i')
.with_columns(
pl.col('a').alias('acopy'),
pl.int_ranges(pl.col('a').list.len()).alias('a.index(b)'))
.explode('acopy','a.index(b)')
.with_columns(
(pl.when(pl.col('acopy')==pl.col('b'))
.then(pl.col('a.index(b)'))
.otherwise(None)).alias('a.index(b)')
)
.group_by('i', maintain_order=True)
.agg(
pl.exclude('a.index(b)').first(),
pl.col('a.index(b)').min())
.drop('i', 'acopy')
)
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3 ┆ null │
│ [4, 4, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
Essentially this trades the filter
for a when.then.otherwise
and group_by
. For the group_by
to work we have to create a row index helper column at the beginning.