Search code examples
pythonpandasmissing-data

Filling missing values of test from groupby mean of training set


I have two dataframes, train and test. The test set has missing values on a column.

import numpy as np
import pandas as pd

train = [[0,1],[0,2],[0,3],[0,7],[0,7],[1,3],[1,5],[1,2],[1,2]]
test = [[0,0],[0,np.nan],[1,0],[1,np.nan]]

train = pd.DataFrame(train, columns = ['A','B'])
test = pd.DataFrame(test, columns = ['A','B'])

The test set has two missing values on column B. If the groupby column is A

  • If the imputing strategy is mode, then the missing values should be imputed with 7 and 2.
  • If the imputing strategy is mean, then the missing values should be (1+2+3+7+7)/5 = 4 and (3+5+2+2)/4 = 3.

What is a good way to do this?

This question is related, but uses only one dataframe instead of two.


Solution

  • IIUC, here's one way:

    from statistics import mode
    
    test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
    test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()
    

    If you want a function:

    from statistics import mode
    
    def evaluate_nan(strategy= 'mean'):
        return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()
    
    test_mean = evaluate_nan()
    test_mode = evaluate_nan(strategy = mode)