Search code examples
pythonpython-polars

How count occurrences *across all columns* in a polars dataframe?


I have an imported .csv of number values- I want to sort the dataframe so I end up a list showing how many times each value occurs across the whole dataframe.

E.g.

1: 5
2: 0
3: 23
4: 8

I have found how to count the values of a specified column, but I can't find a way to do the same thing for the entire dataframe- I could count the values of each column and then combine them afterward but it is a bit clunky and I was looking for a more elegant solution.

This is an example of what I was trying:

sort_dataframe = df.select(pl.col("1", "2", "3", "4", "5", "6", "7").value_counts())

Which results in:

polars.exceptions.ComputeError: Series length 16 doesn't match the DataFrame height of 26

Solution

  • TLDR. You can use value_counts after unpivoting the dataframe into a long format.

    df.melt().get_column("value").value_counts()
    

    Explanation

    Let us consider the following example dataframe.

    import polars as pl
    
    df = pl.DataFrame({
        "col_1": [1, 2, 3],
        "col_2": [2, 3, 7],
        "col_3": [1, 1, 9],
    })
    
    shape: (3, 3)
    ┌───────┬───────┬───────┐
    │ col_1 ┆ col_2 ┆ col_3 │
    │ ---   ┆ ---   ┆ ---   │
    │ i64   ┆ i64   ┆ i64   │
    ╞═══════╪═══════╪═══════╡
    │ 1     ┆ 2     ┆ 1     │
    │ 2     ┆ 3     ┆ 1     │
    │ 3     ┆ 7     ┆ 9     │
    └───────┴───────┴───────┘
    

    First, we can unpivot the dataframe using pl.DataFrame.melt to obtain a single column containing all values.

    df.melt()
    
    shape: (9, 2)
    ┌──────────┬───────┐
    │ variable ┆ value │
    │ ---      ┆ ---   │
    │ str      ┆ i64   │
    ╞══════════╪═══════╡
    │ col_1    ┆ 1     │
    │ col_1    ┆ 2     │
    │ col_1    ┆ 3     │
    │ col_2    ┆ 2     │
    │ col_2    ┆ 3     │
    │ col_2    ┆ 7     │
    │ col_3    ┆ 1     │
    │ col_3    ┆ 1     │
    │ col_3    ┆ 9     │
    └──────────┴───────┘
    

    Finally, we can get the value column as a pl.Series and use pl.Series.value_counts to count the number of occurrences of each value.

    counts = df.melt().get_column("value").value_counts()
    
    shape: (5, 2)
    ┌───────┬───────┐
    │ value ┆ count │
    │ ---   ┆ ---   │
    │ i64   ┆ u32   │
    ╞═══════╪═══════╡
    │ 7     ┆ 1     │
    │ 3     ┆ 2     │
    │ 9     ┆ 1     │
    │ 2     ┆ 2     │
    │ 1     ┆ 3     │
    └───────┴───────┘
    

    This can also be simply converted to a python dictionary.

    dict(counts.iter_rows())
    
    {3: 2, 7: 1, 1: 3, 2: 2, 9: 1}