Search code examples
pythonpython-polars

How to get max value record per a group in polars dataframe?


I have a dataframe as:

pl.DataFrame(
[{'row_nr': 0, 'middle_name_or_initial': 'R.', 'count': 8},
 {'row_nr': 0, 'middle_name_or_initial': 'B.', 'count': 1},
 {'row_nr': 1, 'middle_name_or_initial': 'D.', 'count': 1},
 {'row_nr': 1, 'middle_name_or_initial': 'J.', 'count': 11},
 {'row_nr': 2, 'middle_name_or_initial': 'Micha', 'count': 1},
 {'row_nr': 2, 'middle_name_or_initial': 'M.', 'count': 1},
 {'row_nr': 2, 'middle_name_or_initial': 'J.', 'count': 1},
 {'row_nr': 3, 'middle_name_or_initial': 'E.', 'count': 1},
 {'row_nr': 3, 'middle_name_or_initial': 'S.', 'count': 1},
 {'row_nr': 4, 'middle_name_or_initial': 'M.', 'count': 1},
 {'row_nr': 4, 'middle_name_or_initial': 'P.', 'count': 1},
 {'row_nr': 5, 'middle_name_or_initial': 'Christopher', 'count': 15},
 {'row_nr': 5, 'middle_name_or_initial': 'Robert', 'count': 1},
 {'row_nr': 6, 'middle_name_or_initial': 'Lusi', 'count': 1},
 {'row_nr': 6, 'middle_name_or_initial': 'Luis', 'count': 1}])

Here I would like to select the observations on a condition as: on grouping row_nr and middle_name the highest count observation to be given as output per a group.

if the counts are equal i.e 1 all of these group rows will be returned.

expected output as

enter image description here


Solution

  • Here you go. It would have been nice if you had a clearer example, with fewer rows and more cases of count != 1, as well as the expected solution as code rather than screenshot so the responder can test their solution (using test_df.equals(solution_df)). See below for a suggestion for how to create this.

    I also think you could have phrased your question better: You said

    on grouping row_nr and middle_name the highest count observation to be given as output per a group

    but your suggested solution implies that you want to group on row_nr and select the rows containing the highest count. "grouping row_nr and middle_name" is what I believe set @Will on his track.

    import polars as pl
    
    pl.Config.set_tbl_rows(100) # set max print row-number high enough
    
    tmp = pl.DataFrame(
        [
            {"row_nr": 0, "middle_name_or_initial": "R.", "count": 8},
            {"row_nr": 0, "middle_name_or_initial": "B.", "count": 1},
            {"row_nr": 1, "middle_name_or_initial": "D.", "count": 1},
            {"row_nr": 1, "middle_name_or_initial": "J.", "count": 11},
            {"row_nr": 2, "middle_name_or_initial": "Micha", "count": 1},
            {"row_nr": 2, "middle_name_or_initial": "M.", "count": 1},
            {"row_nr": 2, "middle_name_or_initial": "J.", "count": 1},
            {"row_nr": 3, "middle_name_or_initial": "E.", "count": 1},
            {"row_nr": 3, "middle_name_or_initial": "S.", "count": 1},
            {"row_nr": 4, "middle_name_or_initial": "M.", "count": 1},
            {"row_nr": 4, "middle_name_or_initial": "P.", "count": 1},
            {"row_nr": 5, "middle_name_or_initial": "Christopher", "count": 15},
            {"row_nr": 5, "middle_name_or_initial": "Robert", "count": 1},
            {"row_nr": 6, "middle_name_or_initial": "Lusi", "count": 1},
            {"row_nr": 6, "middle_name_or_initial": "Luis", "count": 1},
        ]
    )
    
    df = (
        tmp.with_columns(pl.col("count").max().over("row_nr").alias("row_max"))
        .filter(pl.col("count") == pl.col("row_max"))
        .select("count", "middle_name_or_initial", "row_nr")
    )
    df
    
    ------
    
    shape: (12, 3)
    ┌───────┬────────────────────────┬────────┐
    │ count ┆ middle_name_or_initial ┆ row_nr │
    │ ---   ┆ ---                    ┆ ---    │
    │ i64   ┆ str                    ┆ i64    │
    ╞═══════╪════════════════════════╪════════╡
    │ 8     ┆ R.                     ┆ 0      │
    │ 11    ┆ J.                     ┆ 1      │
    │ 1     ┆ Micha                  ┆ 2      │
    │ 1     ┆ M.                     ┆ 2      │
    │ 1     ┆ J.                     ┆ 2      │
    │ 1     ┆ E.                     ┆ 3      │
    │ 1     ┆ S.                     ┆ 3      │
    │ 1     ┆ M.                     ┆ 4      │
    │ 1     ┆ P.                     ┆ 4      │
    │ 15    ┆ Christopher            ┆ 5      │
    │ 1     ┆ Lusi                   ┆ 6      │
    │ 1     ┆ Luis                   ┆ 6      │
    └───────┴────────────────────────┴────────┘
    

    Here is a way of creating a test dataframe code from the result above:

    df.to_dict(as_series=False)
    ---
    {'count': [8, 11, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1],
     'middle_name_or_initial': ['R.',
      'J.',
      'Micha',
      'M.',
      'J.',
      'E.',
      'S.',
      'M.',
      'P.',
      'Christopher',
      'Lusi',
      'Luis'],
     'row_nr': [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6]}
    

    This can then be written as a test case in your question (in case you have future questions):

    df_test = pl.DataFrame(
        {
            "count": [8, 11, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1],
            "middle_name_or_initial": [
                "R.",
                "J.",
                "Micha",
                "M.",
                "J.",
                "E.",
                "S.",
                "M.",
                "P.",
                "Christopher",
                "Lusi",
                "Luis",
            ],
            "row_nr": [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6],
        }
    )