Search code examples
pythonpandasaggregate

Reduce pandas data frame to have one column with list of repeating values


I have the following dataframe:

index   name     path
0       Dina     "gs://my_bucket/folder1/img1.png"
1       Dina     "gs://my_bucket/folder1/img2.png"
2       Lane     "gs://my_bucket/folder1/img3.png"
3       Bari     "gs://my_bucket/folder1/img4.png"
4       Andrew   "gs://my_bucket/folder1/img5.png"
5       Andrew   "gs://my_bucket/folder1/img6.png"
6       Andrew   "gs://my_bucket/folder1/img7.png"
7       Beti     "gs://my_bucket/folder1/img7.png"
8       Ladin    "gs://my_bucket/folder1/img5.png"
...

I would like to get new dataframe which will have the unique names appears only once, and the path column will be list with the matching paths. The output should look like this:

index   name     path
0       Dina     ["gs://my_bucket/folder1/img1.png","gs://my_bucket/folder1/img2.png"]
1       Lane     ["gs://my_bucket/folder1/img3.png"]
2       Bari     ["gs://my_bucket/folder1/img4.png"]
3       Andrew   ["gs://my_bucket/folder1/img5.png","gs://my_bucket/folder1/img6.png","gs://my_bucket/folder1/img7.png"]
4       Beti     ["gs://my_bucket/folder1/img7.png"]
5       Ladin    ["gs://my_bucket/folder1/img5.png"]
...

The result should have number of rows equal to unique names in the dataframe. At the moment I'm using something I did with chatgpt, but it used function that I don't understand why is it used and also it duplicates the names of the rows, so if I know I suppose to have 842 unique names, I get 992 ...

This is chatGPT solution:

# Define a custom aggregation function to combine links as a list
def combine_links(links):
    return list(set(links))  # Convert links to a list and remove duplicates

# Group the GeoDataFrame by 'name' and 'dili' and aggregate the 'link' column
result = df.groupby(['name'))['path'].agg(combine_links).reset_index()

My goal is to find a solution the gives me in the end the right number of rows, which is number of unique names.


Solution

  • A possible solution:

    df.groupby('name')['path'].agg(list).reset_index()
    

    Output:

         name                                               path
    0  Andrew  [gs://my_bucket/folder1/img5.png, gs://my_buck...
    1    Bari                  [gs://my_bucket/folder1/img4.png]
    2    Beti                  [gs://my_bucket/folder1/img7.png]
    3    Dina  [gs://my_bucket/folder1/img1.png, gs://my_buck...
    4   Ladin                  [gs://my_bucket/folder1/img5.png]
    5    Lane                  [gs://my_bucket/folder1/img3.png]