I have two dataframes, train
and test
. The test
set has missing values on a column.
import numpy as np
import pandas as pd
train = [[0,1],[0,2],[0,3],[0,7],[0,7],[1,3],[1,5],[1,2],[1,2]]
test = [[0,0],[0,np.nan],[1,0],[1,np.nan]]
train = pd.DataFrame(train, columns = ['A','B'])
test = pd.DataFrame(test, columns = ['A','B'])
The test set has two missing values on column B
. If the groupby column is A
mode
, then the missing values should be imputed with 7
and 2
.mean
, then the missing values should be (1+2+3+7+7)/5 = 4
and (3+5+2+2)/4 = 3
.What is a good way to do this?
This question is related, but uses only one dataframe instead of two.
IIUC, here's one way:
from statistics import mode
test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()
If you want a function:
from statistics import mode
def evaluate_nan(strategy= 'mean'):
return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()
test_mean = evaluate_nan()
test_mode = evaluate_nan(strategy = mode)