Search code examples
pythonpandasjoinpandas-loc

Most efficient way of joining dataframes in pandas: loc or join?


Suppose I have two dataframes; one holds transactions, trans and the other holds product information, prod, and I want to join the product prices, the variable price, on to the transaction data frame, repeating them down for each column. Which of these approaches is more efficient / preferred:

Method 1:

trans = trans.set_index('product_id').join(trans.set_index('product_id'))

Method 2:

trans.set_index('product_id',inplace=True)
trans['price'] = prod.loc[trans.product_id, 'price']

Solution

  • It seems you need map:

    trans = pd.DataFrame({'product_id':[1,2,3],
                       'price':[4,5,6]})
    
    print (trans)
       price  product_id
    0      4           1
    1      5           2
    2      6           3
    
    prod = pd.DataFrame({'product_id':[1,2,4],
                       'price':[40,50,60]})
    
    print (prod)
       price  product_id
    0     40           1
    1     50           2
    2     60           4
    
    d = prod.set_index('product_id')['price'].to_dict()
    trans['price'] = trans['product_id'].map(d)
    print (trans)
       price  product_id
    0   40.0           1
    1   50.0           2
    2    NaN           3