I'm trying to create the "Related Quantities" column of a dataframe given the existing "Item", "Quantity", and "Related Items" columns.
Item | Quantity | Related Items | Related Quantities | |
---|---|---|---|---|
0 | Flowers | 1 | ['Bushes'] | {'Bushes': 2} |
1 | Bushes | 2 | ['Flowers'] | {'Flowers': 1} |
2 | Cars | 3 | ['Trucks', 'Motorcycles'] | {'Trucks': 4, 'Motorcycles': 5} |
3 | Trucks | 4 | ['Cars', 'Motorcycles'] | {'Cars': 3, 'Motorcycles': 5} |
4 | Motorcycles | 5 | ['Cars', 'Trucks'] | {'Cars': 3, 'Trucks': 4} |
The values in the dictionaries will be used as the input of a function to generate another column later
I believe I have a line that will make the dictionary for a single row, which I could use to fill out the column using iterrows()
and a for loop:
dictionary = {item : df.loc[df['Item'] == item,['Quantity']][idx] for item in related_items_list}
(where idx
is something to grab the corresponding index of the one row left after filtering by 'Item'
, and related_items_list
is the value grabbed from the 'Related Items'
column of the current row in the loop)
But I'm trying to make something with df.apply()
instead, in hopes that it will be more performant. Is there some way to allow the function called in apply()
to access the whole dataframe instead of just the row passed to it?
I think I may be way overcomplicating this. Is trying to use apply()
for performance a waste of time? Should I just be using a loop to feed the 'Quantity'
data into the function directly instead of making this column at all? I worry that will also hurt performance
Thank you. It looks like to_dict() is somewhat faster than making the lookup dict using a for loop.
number of rows in df:
704400
Testing with 100 iterations:
time of 'for item,qty in zip of two columns' method:
19.455784299999998
time of 'df.set_index(col1)[col2].to_dict()' method:
11.409341199999997
Using df.apply
with a lookup dictionary.
import pandas as pd
data = {'Item': ['Flowers', 'Bushes', 'Cars', 'Trucks', 'Motorcycles'],
'Quantity': [1, 2, 3, 4, 5],
'Related Items': [['Bushes'], ['Flowers'], ['Trucks', 'Motorcycles'], ['Cars', 'Motorcycles'], ['Cars', 'Trucks']]}
df = pd.DataFrame(data)
# Creates a dictionary for fast quantity lookups
item_quantities = df.set_index('Item')['Quantity'].to_dict()
def create_related_quantities(row):
related_quantities = {}
for item in row['Related Items']:
quantity = item_quantities.get(item) # Get quantity from lookup dictionary or None if not found
if quantity is not None: # Only add to dict if quantity exists
related_quantities[item] = quantity
return related_quantities
# Applies the create_related_quantities function to each row (axis=1) of the DataFrame
df['Related Quantities'] = df.apply(create_related_quantities, axis=1)
print(df)
The new column created
Item ... Related Quantities
0 Flowers ... {'Bushes': 2}
1 Bushes ... {'Flowers': 1}
2 Cars ... {'Trucks': 4, 'Motorcycles': 5}
3 Trucks ... {'Cars': 3, 'Motorcycles': 5}
4 Motorcycles ... {'Cars': 3, 'Trucks': 4}