Search code examples
pythonpandasdataframemerge

Python: Merge two dataframes on certain columns using the excel equivalent of index/match


I'm trying to replicate an excel type of "index/match" in python. I have two dfs.

df1:

RIC Sector Code SP500 SP400 SP600
FI 40 SP500
BRBR 30 SP400
XPEL 25 SP600

df2: (contains lookup table)

Code Name SP400 SP500 SP600
25 Consumer Discretionary .SPMDCD .SPSMCD .SPLRCD
30 Consumer Staples .SPMDCS .SPSMCS .SPLRCS
10 Energy .SPMDCE .SPSMCE .SPNY
40 Financials .SPMDCF .SPSMCF .SPSY
35 Health Care .SPMDCA .SPSMCA .SPXHC
20 Industrials .SPMDCI .SPSMCI .SPLRCI
45 Information Technology .SPMDCT .SPSMCT .SPLRCT
15 Materials .SPMDCM .SPSMCM .SPLRCM
50 Telecommunication Services .SPMDCL .SPSMCL .SPLRCL
55 Utilities .SPMDCU .SPSMCU .SPLRCU

Desired Result: (Adds the result to df1)

RIC Sector Code SP500 SP400 SP600 RESULT SP500 RESULT SP400 RESULT SP600
FI 40 SP500 .SPSMCF
BRBR 30 SP400 .SPMDCS
XPEL 25 SP600 .SPLRCD

I tried merged_df = df1.merge(df2, left_on=['Sector Code'], right_on=['Code'], how='left').

This combined df1 and df2, but it give me every match (SP500, SP400, and SP600) instead of just the one match for each.


Solution

  • This Code is Most Efficient for huge Datasets.

    import pandas as pd
    
    df1 = pd.DataFrame([
     {'RIC': 'FI', 'Sector Code': 40, 'SP500': 'SP500', 'SP400': '', 'SP600': ''},
     {'RIC': 'BRBR', 'Sector Code': 30, 'SP400': 'SP400', 'SP500': '', 'SP600': ''},
     {'RIC': 'XPEL', 'Sector Code': 25, 'SP400': '', 'SP500': '', 'SP600': 'SP600'}])
    
    df2 = pd.DataFrame({
     'Code': [25, 30, 10, 40, 35, 20, 45, 15, 50, 55],
     'Name': ['Consumer Discretionary', 'Consumer Staples', 'Energy', 'Financials', 
               'Health Care', 'Industrials', 'Information Technology', 'Materials', 
               'Telecommunication Services', 'Utilities'],
     'SP400': ['.SPMDCD', '.SPMDCS', '.SPMDCE', '.SPMDCF', '.SPMDCA', '.SPMDCI', 
               '.SPMDCT', '.SPMDCM', '.SPMDCL', '.SPMDCU'],
     'SP500': ['.SPSMCD', '.SPSMCS', '.SPSMCE', '.SPSMCF', '.SPSMCA', '.SPSMCI', 
               '.SPSMCT', '.SPSMCM', '.SPSMCL', '.SPSMCU'],
     'SP600': ['.SPLRCD', '.SPLRCS', '.SPNY', '.SPSY', '.SPXHC', '.SPLRCI', 
               '.SPLRCT', '.SPLRCM', '.SPLRCL', '.SPLRCU']})
    
    """
    print(df1)
        RIC  Sector Code  SP500  SP400  SP600
    0    FI           40  SP500              
    1  BRBR           30         SP400       
    2  XPEL           25                SP600
    
    print(df2)
    
       Code                        Name    SP400    SP500    SP600
    0    25      Consumer Discretionary  .SPMDCD  .SPSMCD  .SPLRCD
    1    30            Consumer Staples  .SPMDCS  .SPSMCS  .SPLRCS
    2    10                      Energy  .SPMDCE  .SPSMCE    .SPNY
    3    40                  Financials  .SPMDCF  .SPSMCF    .SPSY
    4    35                 Health Care  .SPMDCA  .SPSMCA   .SPXHC
    5    20                 Industrials  .SPMDCI  .SPSMCI  .SPLRCI
    6    45      Information Technology  .SPMDCT  .SPSMCT  .SPLRCT
    7    15                   Materials  .SPMDCM  .SPSMCM  .SPLRCM
    8    50  Telecommunication Services  .SPMDCL  .SPSMCL  .SPLRCL
    9    55                   Utilities  .SPMDCU  .SPSMCU  .SPLRCU
    """
    sp_cols = [col for col in df2.filter(like='SP')]
    res_dict ={}
    for col in sp_cols :
        res_dict[f'RESULT_{col}'] = df1['Sector Code'].apply(
            lambda d : df2.set_index('Code').loc[d,col] if d in df2['Code'].tolist() else np.nan
            
            )
        #print(res_dict[f'RESULT_{col}'])
    df1_join_res_dict = df1.join(pd.DataFrame(res_dict))
    print(df1_join_res_dict.to_string())
    """
       RIC  Sector Code  SP500  SP400  SP600 RESULT_SP400 RESULT_SP500 RESULT_SP600
    0    FI           40  SP500                    .SPMDCF      .SPSMCF        .SPSY
    1  BRBR           30         SP400             .SPMDCS      .SPSMCS      .SPLRCS
    2  XPEL           25                SP600      .SPMDCD      .SPSMCD      .SPLRCD
    
    """