I have hard a hard time searching for the answer to this as I find it hard to put into words.
I'm trying to aggregate multiple listings of files on disks, some of which have the same files. I want only one row for a given file, and a separate column with the disks that file may be on.
Say I have the following DataFrame:
recordings = pl.DataFrame(
{
"disk": ["NT23", "NT24", "NT23", "NT24"],
"path_on_disk": ["file_a.txt", "file_a.txt", "file_b.txt", "file_b.txt"],
"other_data": [2.0, 2.0, 3.0, 3.0],
}
)
Which looks something like this (SO doesn't like terminal characters I guess):
┌──────┬──────────────┬────────────┐
│ disk ┆ path_on_disk ┆ other_data │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════╪══════════════╪════════════╡
│ NT23 ┆ file_a.txt ┆ 2.0 │
│ NT24 ┆ file_a.txt ┆ 2.0 │
│ NT23 ┆ file_b.txt ┆ 3.0 │
│ NT24 ┆ file_b.txt ┆ 3.0 │
└──────┴──────────────┴────────────┘
path_on_disk
matches in two or more rows then we will assume that the rest of the fields in those rows match, except for disk
.I want to:
path_on_disk
are the samedisks
containing the different values of disk
joined together with ", ".join()
Something like this:
┌──────────────┬────────────┬────────────┐
│ path_on_disk ┆ disks ┆ other_data │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════════╪════════════╪════════════╡
│ file_a ┆ NT23, NT24 ┆ 2.0 │
│ file_b ┆ NT23, NT24 ┆ 3.0 │
└──────────────┴────────────┴────────────┘
I've figured out that I can use recordings.group_by("path_on_disk")
to accomplish the first objective:
for group_df in recordings.group_by("path_on_disk"):
if len(group_df) > 1:
print(group_df)
break
This shows the first found group of rows where path_on_disk
match.
I tried the following, but got an error:
def merge_disknames(df: pl.DataFrame):
return ", ".join(sorted(df["disk"]))
recordings.group_by("path_on_disk").map_groups(merge_disknames).rename("disks")
PanicException: Could net get DataFrame attribute '_df'. Make sure that you return a DataFrame object.: PyErr { type: <class 'AttributeError'>, value: AttributeError("'str' object has no attribute '_df'"), traceback: None }
The most polars-esque way to do this would be to, as you say, do a group_by. From there though, common operation after a group_by is to do an agg
regation. Here the unique instances of the disk
column by group gives us a list
column, which we can then use the list.join
method to finish the operation off.
All together:
(recordings.group_by("path_on_disk")
.agg(disks=pl.col("disk").unique())
.with_columns(pl.col("disks").list.join(", "))
)
shape: (2, 2)
┌──────────────┬────────────┐
│ path_on_disk ┆ disks │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════╪════════════╡
│ file_a.txt ┆ NT23, NT24 │
│ file_b.txt ┆ NT23, NT24 │
└──────────────┴────────────┘
(Note that the order of disks
is essentially random the way group_by works, you can do a sort
/ maintain_order
if desired.)