Search code examples
pythonpandas

Create a new column of dictionaries where keys are in another column of lists, and values are found by looking up the keys in another column


I'm trying to create the "Related Quantities" column of a dataframe given the existing "Item", "Quantity", and "Related Items" columns.

Item Quantity Related Items Related Quantities
0 Flowers 1 ['Bushes'] {'Bushes': 2}
1 Bushes 2 ['Flowers'] {'Flowers': 1}
2 Cars 3 ['Trucks', 'Motorcycles'] {'Trucks': 4, 'Motorcycles': 5}
3 Trucks 4 ['Cars', 'Motorcycles'] {'Cars': 3, 'Motorcycles': 5}
4 Motorcycles 5 ['Cars', 'Trucks'] {'Cars': 3, 'Trucks': 4}

The values in the dictionaries will be used as the input of a function to generate another column later


I believe I have a line that will make the dictionary for a single row, which I could use to fill out the column using iterrows() and a for loop:

dictionary = {item : df.loc[df['Item'] == item,['Quantity']][idx] for item in related_items_list}

(where idx is something to grab the corresponding index of the one row left after filtering by 'Item', and related_items_list is the value grabbed from the 'Related Items' column of the current row in the loop)

But I'm trying to make something with df.apply() instead, in hopes that it will be more performant. Is there some way to allow the function called in apply() to access the whole dataframe instead of just the row passed to it?

I think I may be way overcomplicating this. Is trying to use apply() for performance a waste of time? Should I just be using a loop to feed the 'Quantity' data into the function directly instead of making this column at all? I worry that will also hurt performance

EDIT

Thank you. It looks like to_dict() is somewhat faster than making the lookup dict using a for loop.

number of rows in df:
704400

Testing with 100 iterations:
time of 'for item,qty in zip of two columns' method:
19.455784299999998

time of 'df.set_index(col1)[col2].to_dict()' method:
11.409341199999997

Solution

  • Using df.apply with a lookup dictionary.

    import pandas as pd
    
    data = {'Item': ['Flowers', 'Bushes', 'Cars', 'Trucks', 'Motorcycles'],
            'Quantity': [1, 2, 3, 4, 5],
            'Related Items': [['Bushes'], ['Flowers'], ['Trucks', 'Motorcycles'], ['Cars', 'Motorcycles'], ['Cars', 'Trucks']]}
    df = pd.DataFrame(data)
    
    
    # Creates a dictionary for fast quantity lookups
    item_quantities = df.set_index('Item')['Quantity'].to_dict()
    
    def create_related_quantities(row):
        related_quantities = {}
        for item in row['Related Items']:
            quantity = item_quantities.get(item)  # Get quantity from lookup dictionary or None if not found
            if quantity is not None:  # Only add to dict if quantity exists
                related_quantities[item] = quantity
        return related_quantities
    
    # Applies the create_related_quantities function to each row (axis=1) of the DataFrame
    df['Related Quantities'] = df.apply(create_related_quantities, axis=1)
    
    print(df)
    

    The new column created

              Item  ...               Related Quantities
    0      Flowers  ...                    {'Bushes': 2}
    1       Bushes  ...                   {'Flowers': 1}
    2         Cars  ...  {'Trucks': 4, 'Motorcycles': 5}
    3       Trucks  ...    {'Cars': 3, 'Motorcycles': 5}
    4  Motorcycles  ...         {'Cars': 3, 'Trucks': 4}