Search code examples
pythonpython-polars

How to select the column from a Polars Dataframe that has the largest sum?


I have Polars dataframe with a bunch of columns I need to find the column with, for example, the largest sum.

The below snippet sums all of the columns:

df = pl.DataFrame(
    {
        "a": [0, 1, 3, 4],
        "b": [0, 0, 0, 0],
        "c": [1, 0, 1, 0],
    }
)

max_col = df.select(pl.col(df.columns).sum())  
shape: (1, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 8   ┆ 0   ┆ 2   │
└─────┴─────┴─────┘

But I'm missing the last step of selecting the column with the largest value?


Solution

  • I would do this as a melt/filter.

    test \
        .select(pl.all().sum()) \
        .melt() \
        .filter(pl.col('value')==pl.col('value').max())
    

    If you want the original shape then a single chain is a bit tougher. I'd just do it like this instead.

    allcalc=test \
        .select(pl.all().sum())
    allcalc.select(allcalc.melt().filter(pl.col('value')==pl.col('value').max()) \
           .get_column('variable').to_list())
    

    The above works if there is a tie, for instance if you have:

    test=pl.DataFrame(
        {
            "a": [0, 1, 3, 4],
            "b": [0, 0, 0, 0],
            "c": [1, 0, 1, 6],
        }
    )
    

    then you'll get 'a' and 'c' in either case.