Search code examples
pythonpandasdataframeoptimizationlarge-data

Fastest way to join coulmn values in pandas dataframe?


Problem:

Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?

My current solution:

import pandas as pd
import numpy as np
  
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
               'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
               'd0': ['a', 'x', 'a', '1'],
               'd1': ['b', 'x', 'b', '2'],
               'd2': ['c', 'x', np.nan, '3']})
#print(df)

msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]

mask = df['msg'].isin(msg_text_filter)

df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
    mask,
    df[['d0','d1','d2']].agg(''.join, axis=1),
    df['d0']
)
df.loc[mask, columns] = np.nan

print(df)


Solution

  • IMHO you can save a lot of time by using

    df[['d0', 'd1', 'd2']].sum(axis=1)
    

    instead of

    df[['d0', 'd1', 'd2']].agg(''.join, axis=1)
    

    And I think instead of using np.where you could just do:

    df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)