I have an imported .csv of number values- I want to sort the dataframe so I end up a list showing how many times each value occurs across the whole dataframe.
E.g.
1: 5
2: 0
3: 23
4: 8
I have found how to count the values of a specified column, but I can't find a way to do the same thing for the entire dataframe- I could count the values of each column and then combine them afterward but it is a bit clunky and I was looking for a more elegant solution.
This is an example of what I was trying:
sort_dataframe = df.select(pl.col("1", "2", "3", "4", "5", "6", "7").value_counts())
Which results in:
polars.exceptions.ComputeError: Series length 16 doesn't match the DataFrame height of 26
TLDR. You can use value_counts
after unpivoting the dataframe into a long format.
df.unpivot().get_column("value").value_counts()
Let us consider the following example dataframe.
import polars as pl
df = pl.DataFrame({
"col_1": [1, 2, 3],
"col_2": [2, 3, 7],
"col_3": [1, 1, 9],
})
shape: (3, 3)
┌───────┬───────┬───────┐
│ col_1 ┆ col_2 ┆ col_3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╡
│ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 3 ┆ 1 │
│ 3 ┆ 7 ┆ 9 │
└───────┴───────┴───────┘
First, we can unpivot the dataframe using pl.DataFrame.unpivot
to obtain a single column containing all values.
df.unpivot()
shape: (9, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════╪═══════╡
│ col_1 ┆ 1 │
│ col_1 ┆ 2 │
│ col_1 ┆ 3 │
│ col_2 ┆ 2 │
│ col_2 ┆ 3 │
│ col_2 ┆ 7 │
│ col_3 ┆ 1 │
│ col_3 ┆ 1 │
│ col_3 ┆ 9 │
└──────────┴───────┘
Finally, we can get the value column as a pl.Series
and use pl.Series.value_counts
to count the number of occurrences of each value.
counts = df.melt().get_column("value").value_counts()
shape: (5, 2)
┌───────┬───────┐
│ value ┆ count │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═══════╪═══════╡
│ 7 ┆ 1 │
│ 3 ┆ 2 │
│ 9 ┆ 1 │
│ 2 ┆ 2 │
│ 1 ┆ 3 │
└───────┴───────┘
This can also be simply converted to a python dictionary.
dict(counts.iter_rows())
{3: 2, 7: 1, 1: 3, 2: 2, 9: 1}