Search code examples
pythonpandasdataframeapply

Efficiently add a value to a new column in a large DataFrame


I have two dataframes, adv_text with about 9,000 rows and events with over 900,000 rows. events is essentially an expanded version of adv_text with about 100 rows per row in adv_text. I want to add three columns from adv_text to events.

The following code is a partial addition of one column.

events_x = events.head(30000).copy()

def add_date(game_id):
    date = adv_text[adv_text['id_odsp'] == game_id]['date']
    return(date.iloc[0])

events_x['date'] = events_x['id_odsp'].apply(add_date)

This test code takes almost 25 seconds for 30,000 rows. At this speed, adding all three columns over the full dataframe will take nearly 40 minutes. Is this typical? Is there a faster way to accomplish this task?


Solution

  • IIUC, one way is to use merge:

    events_x['date'] = events_x.merge(adv_text[['id_odsp', 'date']], on='id_odsp')['date']
    

    More information: Pandas Merging 101