I have Polars dataframe with a bunch of columns I need to find the column with, for example, the largest sum.
The below snippet sums all of the columns:
df = pl.DataFrame(
{
"a": [0, 1, 3, 4],
"b": [0, 0, 0, 0],
"c": [1, 0, 1, 0],
}
)
max_col = df.select(pl.col(df.columns).sum())
shape: (1, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 8 ┆ 0 ┆ 2 │
└─────┴─────┴─────┘
But I'm missing the last step of selecting the column with the largest value?
I would do this as a unpivot/filter.
df \
.select(pl.all().sum()) \
.unpivot() \
.filter(pl.col('value')==pl.col('value').max())
If you want the original shape then a single chain is a bit tougher. I'd just do it like this instead.
allcalc=df \
.select(pl.all().sum())
allcalc.select(allcalc.unpivot().filter(pl.col('value')==pl.col('value').max()) \
.get_column('variable').to_list())
The above works if there is a tie, for instance if you have:
df=pl.DataFrame(
{
"a": [0, 1, 3, 4],
"b": [0, 0, 0, 0],
"c": [1, 0, 1, 6],
}
)
then you'll get 'a' and 'c' in either case.