Say I have
df = pl.DataFrame({
'a': [1, 2, 1],
'b': [2, 1, 2],
'c': [3, 3, 2],
'column': ['a', 'c', 'b'],
})
shape: (3, 4)
┌─────┬─────┬─────┬────────┐
│ a ┆ b ┆ c ┆ column │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪════════╡
│ 1 ┆ 2 ┆ 3 ┆ a │
│ 2 ┆ 1 ┆ 3 ┆ c │
│ 1 ┆ 2 ┆ 2 ┆ b │
└─────┴─────┴─────┴────────┘
I want add a column which, for each row, take the value in the row corresponding to the column in column
.
Expected output:
shape: (3, 5)
┌─────┬─────┬─────┬────────┬────────┐
│ a ┆ b ┆ c ┆ column ┆ lookup │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═════╪═════╪═════╪════════╪════════╡
│ 1 ┆ 2 ┆ 3 ┆ a ┆ 1 │
│ 2 ┆ 1 ┆ 3 ┆ c ┆ 3 │
│ 1 ┆ 2 ┆ 2 ┆ b ┆ 2 │
└─────┴─────┴─────┴────────┴────────┘
This works even with repeated columns
df = pl.DataFrame({
'a': [1, 2, 1,3],
'b': [2, 1, 2,4],
'c': [3, 3, 2,5],
'column': ['a', 'c', 'b','a'],
})
df.with_row_count('i').join(
df.with_row_count('i')
.melt(['i','column'], value_name='lookup')
.filter(pl.col('column')==pl.col('variable'))
.select('i','lookup'),
on='i', how='left').sort('i').drop('i')
shape: (4, 5)
┌─────┬─────┬─────┬────────┬────────┐
│ a ┆ b ┆ c ┆ column ┆ lookup │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═════╪═════╪═════╪════════╪════════╡
│ 1 ┆ 2 ┆ 3 ┆ a ┆ 1 │
│ 2 ┆ 1 ┆ 3 ┆ c ┆ 3 │
│ 1 ┆ 2 ┆ 2 ┆ b ┆ 2 │
│ 3 ┆ 4 ┆ 5 ┆ a ┆ 3 │
└─────┴─────┴─────┴────────┴────────┘
Here's another way that doesn't use melt
df.join(
df.select(pl.col('column').unique().sort()).with_row_count('i'),
on='column'
).with_columns(
lookup=(pl.concat_list('a','b','c').list.get(pl.col('i')))
).drop('i')
It works by converting the 'a','b','c' columns into a list column and then using .get
to extract the index of that list since get will take an expression. For that to work we first have to make a lookup table of what index value abc are. It obviously breaks badly if the existing columns don't perfectly match what's in column
. Assuming your data is scrubbed in advance then I think this would be faster than the melt.