I have two dataframes.
df1
col1 var1 var2 var3
X11 NA (for var3)
X12 NA (for var2)
X13 NA (for var1)
df1 has a few columns (float64 type representing some categories) like var1, var2, var3 with values between 1-5 for each and some missing values for the categories.
I want to fill in the missing values (in each of var1, var2, and var3 columns) using another dataframe, df2 such that df2 has a column with the value for the category.
df2
col1 col2 val col4
X11 var1 3 X11-X21
X12 var3 2 X21-X22
X13 var2 1 X13-X32
col4 is the concatenation of col1 and col2 but it did not help much.
How could I do this? Since we need to look up on several columns and also because of the structure of df1, I found it complicated to use pivot or melt or even one-hot encoding (produces 5 columns each with _1 to _5 suffixed. I also though about creating a set but then the pairs must be unique which is not the case. Same when I thought of using dictionary as I cannot think of unique keys.
How could I solve this issue?
Thanks.
The example below works with the small sample you provided.
The code goes through the rows of df1
, reading out each row into the variable row
. It then goes through the values (and column names) of row
. When a value is pd.NA
, it indexes into df2
based on the index of row
, and returns the df2["val"]
at that index.
import pandas as pd
#
# Create test data
#
df1 = pd.DataFrame({
'col1': ['X11', 'X12', 'X13'],
'var1': [100, 200, pd.NA],
'var2': [300, pd.NA, 400],
'var3': [pd.NA, 500, 600]
})
df2 = pd.DataFrame({
'col1': ['X11', 'X12', 'X13'],
'col2': ['var3', 'var2', 'var1'],
'val': [3, 2, 1],
})
#
# Solution
#
def fill_missing(row):
idx = row.name #index of current row
#Go over each entry in this row
for col, value in row.items():
if not pd.isna(value):
continue
#If NA, cross reference against df2
# and overwrite the row's value
col_to_fill = df2.loc[idx, 'col2']
assert col_to_fill == col #should match if df2 is right about the NA locations
val_to_fill = df2.loc[idx, 'val']
row[col] = val_to_fill
return row
# Apply the fill_missing function to each row in df1 using .apply(..., axis=1)
df1_filled = df1.copy().apply(fill_missing, axis=1)