I have a dataframe as:
pl.DataFrame(
[{'row_nr': 0, 'middle_name_or_initial': 'R.', 'count': 8},
{'row_nr': 0, 'middle_name_or_initial': 'B.', 'count': 1},
{'row_nr': 1, 'middle_name_or_initial': 'D.', 'count': 1},
{'row_nr': 1, 'middle_name_or_initial': 'J.', 'count': 11},
{'row_nr': 2, 'middle_name_or_initial': 'Micha', 'count': 1},
{'row_nr': 2, 'middle_name_or_initial': 'M.', 'count': 1},
{'row_nr': 2, 'middle_name_or_initial': 'J.', 'count': 1},
{'row_nr': 3, 'middle_name_or_initial': 'E.', 'count': 1},
{'row_nr': 3, 'middle_name_or_initial': 'S.', 'count': 1},
{'row_nr': 4, 'middle_name_or_initial': 'M.', 'count': 1},
{'row_nr': 4, 'middle_name_or_initial': 'P.', 'count': 1},
{'row_nr': 5, 'middle_name_or_initial': 'Christopher', 'count': 15},
{'row_nr': 5, 'middle_name_or_initial': 'Robert', 'count': 1},
{'row_nr': 6, 'middle_name_or_initial': 'Lusi', 'count': 1},
{'row_nr': 6, 'middle_name_or_initial': 'Luis', 'count': 1}])
Here I would like to select the observations on a condition as: on grouping row_nr and middle_name the highest count observation to be given as output per a group.
if the counts are equal i.e 1 all of these group rows will be returned.
expected output as
Here you go. It would have been nice if you had a clearer example, with fewer rows and more cases of count != 1
, as well as the expected solution as code rather than screenshot so the responder can test their solution (using test_df.equals(solution_df)
). See below for a suggestion for how to create this.
I also think you could have phrased your question better: You said
on grouping row_nr and middle_name the highest count observation to be given as output per a group
but your suggested solution implies that you want to group on row_nr and select the rows containing the highest count. "grouping row_nr and middle_name" is what I believe set @Will on his track.
import polars as pl
pl.Config.set_tbl_rows(100) # set max print row-number high enough
tmp = pl.DataFrame(
[
{"row_nr": 0, "middle_name_or_initial": "R.", "count": 8},
{"row_nr": 0, "middle_name_or_initial": "B.", "count": 1},
{"row_nr": 1, "middle_name_or_initial": "D.", "count": 1},
{"row_nr": 1, "middle_name_or_initial": "J.", "count": 11},
{"row_nr": 2, "middle_name_or_initial": "Micha", "count": 1},
{"row_nr": 2, "middle_name_or_initial": "M.", "count": 1},
{"row_nr": 2, "middle_name_or_initial": "J.", "count": 1},
{"row_nr": 3, "middle_name_or_initial": "E.", "count": 1},
{"row_nr": 3, "middle_name_or_initial": "S.", "count": 1},
{"row_nr": 4, "middle_name_or_initial": "M.", "count": 1},
{"row_nr": 4, "middle_name_or_initial": "P.", "count": 1},
{"row_nr": 5, "middle_name_or_initial": "Christopher", "count": 15},
{"row_nr": 5, "middle_name_or_initial": "Robert", "count": 1},
{"row_nr": 6, "middle_name_or_initial": "Lusi", "count": 1},
{"row_nr": 6, "middle_name_or_initial": "Luis", "count": 1},
]
)
df = (
tmp.with_columns(pl.col("count").max().over("row_nr").alias("row_max"))
.filter(pl.col("count") == pl.col("row_max"))
.select("count", "middle_name_or_initial", "row_nr")
)
df
------
shape: (12, 3)
┌───────┬────────────────────────┬────────┐
│ count ┆ middle_name_or_initial ┆ row_nr │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═══════╪════════════════════════╪════════╡
│ 8 ┆ R. ┆ 0 │
│ 11 ┆ J. ┆ 1 │
│ 1 ┆ Micha ┆ 2 │
│ 1 ┆ M. ┆ 2 │
│ 1 ┆ J. ┆ 2 │
│ 1 ┆ E. ┆ 3 │
│ 1 ┆ S. ┆ 3 │
│ 1 ┆ M. ┆ 4 │
│ 1 ┆ P. ┆ 4 │
│ 15 ┆ Christopher ┆ 5 │
│ 1 ┆ Lusi ┆ 6 │
│ 1 ┆ Luis ┆ 6 │
└───────┴────────────────────────┴────────┘
Here is a way of creating a test dataframe code from the result above:
df.to_dict(as_series=False)
---
{'count': [8, 11, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1],
'middle_name_or_initial': ['R.',
'J.',
'Micha',
'M.',
'J.',
'E.',
'S.',
'M.',
'P.',
'Christopher',
'Lusi',
'Luis'],
'row_nr': [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6]}
This can then be written as a test case in your question (in case you have future questions):
df_test = pl.DataFrame(
{
"count": [8, 11, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1],
"middle_name_or_initial": [
"R.",
"J.",
"Micha",
"M.",
"J.",
"E.",
"S.",
"M.",
"P.",
"Christopher",
"Lusi",
"Luis",
],
"row_nr": [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6],
}
)