Search code examples
python-3.xpandasvectorization

Faster Way to Categorize Unique Row Combinations based on a Condition in Python Pandas?


I have a dataset that has the same layout as the one below. I want to categorize where each person went in their car for a given month.

Date Visitor Car Name Location
6-2022 Michael Chrysler Work
6-2022 Jim Subaru Home
6-2022 Jim Subaru Work
6-2022 Michael Chrysler Karate
7-2022 Michael Chrysler Work
7-2022 Jim Subaru Home
7-2022 Jim Subaru Work

The solution would look something like this:

Date Visitor Car Name Location
6-2022 Michael Chrysler Work & Karate
6-2022 Jim Subaru Work & Home
7-2022 Michael Chrysler Work
7-2022 Jim Subaru Work & Home

I HAVE a solution but it scales linearly with time as the number of rows increases. The fastest I have gotten it to run over ~32,000 rows is ~15 minutes.


Solution

  • You could try as follows:

    • Use df.groupby on Date, Visitor, Car Name, and apply join to the values of Location.
    • To change the last , delimiter to an ampersand, you could use Series.replace with a regex pattern.
    res = df.groupby(['Date','Visitor','Car Name'], 
                     as_index=False, sort=False)['Location'].agg(', '.join)
    
    # changing each last `,` delimiter into an `&`
    res['Location'] = res['Location'].replace(r',\s(?=[^,]*$)',' & ', regex=True)
    
    print(res)
    
         Date  Visitor  Car Name       Location
    0  6-2022  Michael  Chrysler  Work & Karate
    1  6-2022      Jim    Subaru    Home & Work
    2  7-2022  Michael  Chrysler           Work
    3  7-2022      Jim    Subaru    Home & Work
    

    If you expect duplicates per group (and want to get rid of them), you could use .agg(lambda x: ', '.join(set(x))) instead of .agg(', '.join). Mind you, this will affect the order, not sure if this is a problem. If so, there are of course also ways to get rid of duplicates in a list, while maintained order.