Search code examples
pythonpandaslookup

lookup missing values from another dataframe in pandas


I have two dataframes.

df1

col1 var1 var2 var3

X11            NA (for var3)

X12 NA (for var2)

X13  NA (for var1)

df1 has a few columns (float64 type representing some categories) like var1, var2, var3 with values between 1-5 for each and some missing values for the categories.

I want to fill in the missing values (in each of var1, var2, and var3 columns) using another dataframe, df2 such that df2 has a column with the value for the category.

df2

col1 col2   val col4

X11  var1   3   X11-X21

X12  var3   2   X21-X22

X13  var2   1   X13-X32

col4 is the concatenation of col1 and col2 but it did not help much.

How could I do this? Since we need to look up on several columns and also because of the structure of df1, I found it complicated to use pivot or melt or even one-hot encoding (produces 5 columns each with _1 to _5 suffixed. I also though about creating a set but then the pairs must be unique which is not the case. Same when I thought of using dictionary as I cannot think of unique keys.

How could I solve this issue?

Thanks.


Solution

  • The example below works with the small sample you provided.

    The code goes through the rows of df1, reading out each row into the variable row. It then goes through the values (and column names) of row. When a value is pd.NA, it indexes into df2 based on the index of row, and returns the df2["val"] at that index.

    enter image description here

    import pandas as pd
    
    #
    # Create test data
    #
    df1 = pd.DataFrame({
        'col1': ['X11', 'X12', 'X13'],
        'var1': [100, 200, pd.NA],
        'var2': [300, pd.NA, 400],
        'var3': [pd.NA, 500, 600]
    })
    
    df2 = pd.DataFrame({
        'col1': ['X11', 'X12', 'X13'],
        'col2': ['var3', 'var2', 'var1'],
        'val': [3, 2, 1],
    })
    
    
    #
    # Solution
    #
    def fill_missing(row):
        idx = row.name #index of current row
        
        #Go over each entry in this row
        for col, value in row.items():
            if not pd.isna(value):
                continue
            #If NA, cross reference against df2
            # and overwrite the row's value
            col_to_fill = df2.loc[idx, 'col2']
            assert col_to_fill == col #should match if df2 is right about the NA locations
            val_to_fill = df2.loc[idx, 'val']
            row[col] = val_to_fill
        return row
    
    # Apply the fill_missing function to each row in df1 using .apply(..., axis=1)
    df1_filled = df1.copy().apply(fill_missing, axis=1)