I’ve reached my first real data engineering problem, and I am looking for the optimal one to do an action. I have a very large dataframe (15 columns, over 20M rows).Each rows contains the following 2 columns:
I'm looking for the most optimal way to change the 'developer' and 'owner' field values, based on whether there is a certain list item in the app_id field. In fact, I have more fields that I'd like to change, but maybe I'm better of merging it. Here is an example of code that does what I intend to do:
import pandas as pd
#sample dataframe
data = {
'app_id': [
['app_id_1', 'app_id_2'],
['app_id_1', 'app_id_2','app_id_5'],
['app_id_3', 'app_id_4'],
['app_id_3', 'app_id_4'],
['app_id_1', 'app_id_2']
],
'developer': ['Developer 1'] * 5,
'owner': ['Owner 1'] * 5,
}
df = pd.DataFrame(data)
#Make that df 1 millions row for benchmarking time
df = pd.concat([df]*200000, ignore_index=True)
# list of apps that I want to change the 'developer' column for
apps_dev2 = ['app_id_1', 'app_id_2']
apps_dev3 = ['app_id_3']
#dictionary that contains the changes that I want to make. For simplicity in this example there's just one item in dictionary and 2 attributes.
dict_apps_changes = {
'Dev2': {'developer':'Developer 2',
'apps': apps_dev2,
'owner': 'Owner 2',
},
'Dev3': {'developer':'Developer 3',
'apps': apps_dev3,
'owner': 'Owner 3',
},
}
I tried this:
df_updated = df.copy()
for key, value in dict_apps_changes.items():
df_updated.loc[df_updated['app_id'].apply(lambda x: any(item in x for item in value['apps'])), 'developer'] = value['developer']
df_updated.loc[df_updated['app_id'].apply(lambda x: any(item in x for item in value['apps'])), 'owner'] = value['owner']
df_updated
Which is taking more time as the dictionary grows, of course. Also with more fields added.
Does anyboday know a way to make it even faster? Note: the df is pulled from a SQL BigQuery table, so happy to embed SQL in my python code if that's the way to go..
I can see you’re using a row by row iteration approach which is quite inefficient as the df size grows. You can use a couple of things to optimize the approach.
dict_apps_changes
dataframe.Here’s an update snippet you can try
# Flatten dict_apps_changes for efficient lookups
flattened_dict_app_changes = {}
for dev_key, dev_info in dict_apps_changes.items():
for app in dev_info['apps']:
flattened_dict_app_changes[app] = {'developer': dev_info['developer'], 'owner': dev_info['owner']}
# Function to apply changes based on app_id
def apply_changes(row):
for app_id in row['app_id']:
if app_id in flattened_dict_app_changes:
return flattened_dict_app_changes[app_id]
return {'developer': row['developer'], 'owner': row['owner']}
# Vectorized operation for updating rows
df = df['app_id'].apply(apply_changes).apply(pd.Series)
df.update(changes)