Search code examples
pythondataframeperformanceoptimization

Identifying specific items in a df list field to update other fields, at scale


I’ve reached my first real data engineering problem, and I am looking for the optimal one to do an action. I have a very large dataframe (15 columns, over 20M rows).Each rows contains the following 2 columns:

  • “app_id”, which contains a list (actually, a string but I can turn the string to a list of string).
  • “developer”, a simple text field
  • "owner", a simple text field

I'm looking for the most optimal way to change the 'developer' and 'owner' field values, based on whether there is a certain list item in the app_id field. In fact, I have more fields that I'd like to change, but maybe I'm better of merging it. Here is an example of code that does what I intend to do:

import pandas as pd

#sample dataframe
data = {
    'app_id': [
        ['app_id_1', 'app_id_2'],
        ['app_id_1', 'app_id_2','app_id_5'],
        ['app_id_3', 'app_id_4'],
        ['app_id_3', 'app_id_4'],
        ['app_id_1', 'app_id_2']
    ],
    'developer': ['Developer 1'] * 5,
    'owner': ['Owner 1'] * 5,
}
df = pd.DataFrame(data)
#Make that df 1 millions row for benchmarking time
df = pd.concat([df]*200000, ignore_index=True)

# list of apps that I want to change the 'developer' column for
apps_dev2 = ['app_id_1', 'app_id_2']
apps_dev3 = ['app_id_3']

#dictionary that contains the changes that I want to make. For simplicity in this example there's just one item in dictionary and 2 attributes. 
dict_apps_changes = {
  'Dev2': {'developer':'Developer 2',
           'apps': apps_dev2,
           'owner': 'Owner 2',
          },
  'Dev3': {'developer':'Developer 3',
           'apps': apps_dev3,
           'owner': 'Owner 3',
          },
}

I tried this:

df_updated = df.copy()
for key, value in dict_apps_changes.items():
    df_updated.loc[df_updated['app_id'].apply(lambda x: any(item in x for item in value['apps'])), 'developer'] = value['developer']
    df_updated.loc[df_updated['app_id'].apply(lambda x: any(item in x for item in value['apps'])), 'owner'] = value['owner']
df_updated

Which is taking more time as the dictionary grows, of course. Also with more fields added.

Does anyboday know a way to make it even faster? Note: the df is pulled from a SQL BigQuery table, so happy to embed SQL in my python code if that's the way to go..


Solution

  • I can see you’re using a row by row iteration approach which is quite inefficient as the df size grows. You can use a couple of things to optimize the approach.

    1. Flatten the dict_apps_changesdataframe.
    2. Use Dataframe.update method which is a vectorized operation built into Pandas that applied to the dataframe in one go.

    Here’s an update snippet you can try

    # Flatten dict_apps_changes for efficient lookups
    flattened_dict_app_changes = {}
    for dev_key, dev_info in dict_apps_changes.items():
        for app in dev_info['apps']:
            flattened_dict_app_changes[app] = {'developer': dev_info['developer'], 'owner': dev_info['owner']}
    
    # Function to apply changes based on app_id
    def apply_changes(row):
        for app_id in row['app_id']:
            if app_id in flattened_dict_app_changes:
                return   flattened_dict_app_changes[app_id]
        return {'developer': row['developer'], 'owner': row['owner']}
    
    # Vectorized operation for updating rows
    df = df['app_id'].apply(apply_changes).apply(pd.Series)
    df.update(changes)