Search code examples
pythonpandaslookup

How to get data from another dataframe based on key values


I have two dataframes

df1

Area-ID-Series Area-ID MODE
BSD-1-1800 BSD-1 VM
BSD-2-1800 BSD-2 VM
BSD-3-1800 BSD-3 VM
BSE-1-2100 BSE-1 XM
BSE-1-900 BSE-1 VM
BSE-2-2100 BSE-2 XM
BSE-2-900 BSE-2 VM
BSE-3-2100 BSE-3 XM
BSE-3-900 BSE-3 VM
SMR-1-1800 SMR-1 VM

df2

Area-ID 900 1800 2100
BSD-1
BSD-2
BSD-3
BSE-1
BSE-2
BSE-3
SMR-1

df2 has a key based on df1 but without the "Series", because the "Series" transformed into columns.

I want to get MODE value for each series based on Area-ID, so my desired output is like this

df2

Area-ID 900 1800 2100
BSD-1 - VM -
BSD-2 - VM -
BSD-3 - VM -
BSE-1 VM - XM
BSE-2 VM - XM
BSE-3 VM - XM
SMR-1 - VM -

I have tried this code but still no idea how it should be done

df_result = pd.merge(df2,
                    df1[['Area-ID', 'MODE']],
                    on= 'Area-ID',
                    how='left')

I have tried to do this in excel with df1 as "city" the formula look like this

900 series column

=IFNA(VLOOKUP(A2&"-"&$B$1,city!A:C,3,0),"-") 

1800 series column

=IFNA(VLOOKUP(A2&"-"&$C$1,city!A:C,3,0),"-") 

2100 series column

=IFNA(VLOOKUP(A2&"-"&$D$1,city!A:C,3,0),"-") 

it worked, but it takes a lot of time because my data is huge so I try to do it with python instead


Solution

  • I don't know merge very well. Perhaps there is a way to do based on it. Did it differently.

    A 'number' column is created. To do this, the rows of the Area-ID-Series column of dataframe df1 are split() into an array by delimiter using split and expand=True is applied to get the columns. The second column is retrieved.

    Further, the dataframe is grouped by df1.index, that is, it is called on each row, where the expression df2['Area-ID'] == x['Area-ID'].values[0] is a mask for indexing rows, and x[' number'].values[0] acts as a column name for df2 filtering. Explicit loc indexing is used, where the row indexes are on the left, the column name is on the right. .values[0] is used to extract the value from the list.

    import pandas as pd
    
    df1 = pd.read_csv('df1.csv', header=0)
    df2 = pd.read_csv('df2.csv', header=0)
    
    df1['number'] = df1['Area-ID-Series'].str.split('-', expand=True)[2].astype(str)
    
    
    def my_func(x):
        df2.loc[df2['Area-ID'] == x['Area-ID'].values[0], x['number'].values[0]] = x['MODE'].values[0]
    
    
    df1.groupby(df1.index).apply(my_func)
    
    print(df2)
    

    Output

      Area-ID  900 1800 2100
    0   BSD-1  NaN   VM  NaN
    1   BSD-2  NaN   VM  NaN
    2   BSD-3  NaN   VM  NaN
    3   BSE-1   VM  NaN   XM
    4   BSE-2   VM  NaN   XM
    5   BSE-3   VM  NaN   XM
    6   SMR-1  NaN   VM  NaN