Search code examples
pythonpython-polars

How to create a polars column listing duplicates of another column


I have hard a hard time searching for the answer to this as I find it hard to put into words.

I'm trying to aggregate multiple listings of files on disks, some of which have the same files. I want only one row for a given file, and a separate column with the disks that file may be on.

Say I have the following DataFrame:

recordings = pl.DataFrame(
    {
        "disk": ["NT23", "NT24", "NT23", "NT24"],
        "path_on_disk": ["file_a.txt", "file_a.txt", "file_b.txt", "file_b.txt"],
        "other_data": [2.0, 2.0, 3.0, 3.0],
    }
)

Which looks something like this (SO doesn't like terminal characters I guess):

┌──────┬──────────────┬────────────┐
│ disk ┆ path_on_disk ┆ other_data │
│ ---  ┆ ---          ┆ ---        │
│ str  ┆ str          ┆ f64        │
╞══════╪══════════════╪════════════╡
│ NT23 ┆ file_a.txt   ┆ 2.0        │
│ NT24 ┆ file_a.txt   ┆ 2.0        │
│ NT23 ┆ file_b.txt   ┆ 3.0        │
│ NT24 ┆ file_b.txt   ┆ 3.0        │
└──────┴──────────────┴────────────┘
  • Actual df has > 500k rows
  • Actual df has more columns, but the if path_on_disk matches in two or more rows then we will assume that the rest of the fields in those rows match, except for disk.

I want to:

  1. Find all rows where path_on_disk are the same
  2. Make a new column disks containing the different values of disk joined together with ", ".join()

Something like this:

┌──────────────┬────────────┬────────────┐
│ path_on_disk ┆ disks      ┆ other_data │
│ ---          ┆ ---        ┆ ---        │
│ str          ┆ str        ┆ f64        │
╞══════════════╪════════════╪════════════╡
│ file_a       ┆ NT23, NT24 ┆ 2.0        │
│ file_b       ┆ NT23, NT24 ┆ 3.0        │
└──────────────┴────────────┴────────────┘

I've figured out that I can use recordings.group_by("path_on_disk") to accomplish the first objective:

for group_df in recordings.group_by("path_on_disk"):
    if len(group_df) > 1:
        print(group_df)
        break

This shows the first found group of rows where path_on_disk match.

I tried the following, but got an error:

def merge_disknames(df: pl.DataFrame): 
    return ", ".join(sorted(df["disk"]))

recordings.group_by("path_on_disk").map_groups(merge_disknames).rename("disks")
PanicException: Could net get DataFrame attribute '_df'. Make sure that you return a DataFrame object.: PyErr { type: <class 'AttributeError'>, value: AttributeError("'str' object has no attribute '_df'"), traceback: None }

Solution

  • The most polars-esque way to do this would be to, as you say, do a group_by. From there though, common operation after a group_by is to do an aggregation. Here the unique instances of the disk column by group gives us a list column, which we can then use the list.join method to finish the operation off.

    All together:

    (recordings.group_by("path_on_disk")
        .agg(disks=pl.col("disk").unique())
        .with_columns(pl.col("disks").list.join(", "))
    )
    
    shape: (2, 2)
    ┌──────────────┬────────────┐
    │ path_on_disk ┆ disks      │
    │ ---          ┆ ---        │
    │ str          ┆ str        │
    ╞══════════════╪════════════╡
    │ file_a.txt   ┆ NT23, NT24 │
    │ file_b.txt   ┆ NT23, NT24 │
    └──────────────┴────────────┘
    

    (Note that the order of disks is essentially random the way group_by works, you can do a sort / maintain_order if desired.)