Search code examples
pythonscikit-learnknn

Python, how to use KNNImputer from sklearn and impute data using groupby (filling missing values)


I have big data and I need to do imputation to fill out missing values. but I need to do imputation using groupby to fill missing values based on another column's value. let's say I have this table

data = {'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
        'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
        'age': [27, 32, 78,27, 32, 78],
        'weight': [160, 209, 130,164, 206, 132],
        'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
        }

df = pd.DataFrame(data) df

I add data for the next day, but as null values, and I end up with this table:

    |name   |job        |age|weight |date
|---|-------|-----------|---|-------|--------
|0  |Alex   |teacher    |27 |160    |6-12-2022
|1  |Ben    |doctor     |32 |209    |6-12-2022
|2  |Marry  |engineer   |78 |130    |6-12-2022
|3  |Alex   |teacher    |27 |164    |6-13-2022
|4  |Ben    |doctor     |32 |206    |6-13-2022
|5  |Marry  |engineer   |78 |132    |6-13-2022
|6  |Alex   |teacher    |NaN|NaN    |6-14-2022
|7  |Ben    |doctor     |NaN|NaN    |6-14-2022
|8  |Marry  |engineer   |NaN|NaN    |6-14-2022

now if I want to fill out values for "age" and "weight" columns using KNNImputer or any other imputer, how can I do that while I groupby "name"? As I mentioned the dataframe is a sample and I am dealing with big data Appreciate any helps

Thanks


Solution

  • @mnm. The problem you posted is not straightforward to solve. The main starting problem is the fact that all numerical values are NaN and all remaining values are strings. The latter is rather inconvenient for KNN because there is no intuitive distance between strings.

    I will (1) use a one-hot encoding for the job variable, and (2) label the consecutive days by 1, 2, and 3. Overall, this results in the following training and test data:

    TRAINING
        name  teacher  doctor  engineer  age  weight  date
    0   Alex        1       0         0   27     160     1
    1    Ben        0       1         0   32     209     1
    2  Marry        0       0         1   78     130     1
    3   Alex        1       0         0   27     164     2
    4    Ben        0       1         0   32     206     2
    5  Marry        0       0         1   78     132     2
    
    
    
    TEST
        name  teacher  doctor  engineer  age  weight  date
    0   Alex        1       0         0  NaN     NaN     3
    1    Ben        0       1         0  NaN     NaN     3
    2  Marry        0       0         1  NaN     NaN     3
    

    Clearly, this is not the only way to recode the job and date variable into numerical values. For example, you can easily ensure that consecutive days are only 1/365 apart. This essentially gives less weight to date differences when running the nearest neighbors algorithm.

    The code below applies KNN to insert a single missing value into the table. To impute all missing observations:

    1. Transform the code underneath "NEAREST NEIGHBORS" into a function.
    2. Loop through the data the test data, identify a NaN, and impute using this function.

    I hope this helps...

    import numpy as np
    import pandas as pd
    from sklearn.neighbors import KNeighborsRegressor
    
    ############
    ### DATA ###
    ############
    TrainingData = { 'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
                    'teacher': [1,0,0,1,0,0],
                    'doctor': [0,1,0,0,1,0],
                    'engineer': [0,0,1,0,0,1],
                    'age': [27, 32, 78,27, 32, 78],
                    'weight': [160, 209, 130,164, 206, 132],
                    'date': [1,1,1,2,2,2]}
    
    TestData = {'name': ['Alex', 'Ben', 'Marry'],
                'teacher': [1,0,0],
                'doctor': [0,1,0],
                'engineer': [0,0,1],
                'age': [np.NaN,np.NaN,np.NaN],
                'weight': [np.NaN,np.NaN,np.NaN],
                'data': [3,3,3]}
    # Convert to pandas dataframe
    dfTraining = pd.DataFrame(TrainingData)
    dfTest = pd.DataFrame(TestData)
    # Print
    print(dfTraining)
    print(dfTest)
    
    
    #########################
    ### NEAREST NEIGHBORS ###
    #########################
    Variable2Predict = 'weight'
    Person2Predict = 'Marry'
    
    #### Step 1: Determine training data
    # Select rows of beloning to Person2Predict
    dfSubSelectedByName = dfTraining.loc[dfTraining['name'] == Person2Predict]
    # Assign dependent variable as the variable to predict
    Y = dfSubSelectedByName[Variable2Predict].to_numpy()
    # Nearest neighbor computation should be based on all columns in the data except 'name' and the variable to predict
    X = dfSubSelectedByName[dfSubSelectedByName.columns[~dfSubSelectedByName.columns.isin(['name', Variable2Predict])]].to_numpy()
    print(X)
    
    ### Step 2: Nearest neighbor prediction
    K = 1
    # Read row from pandas dataframe based on name
    Row2Predict = dfTest.loc[dfTest['name'] == Person2Predict]
    # Select all variable except for name and dependent variable
    Xpred = Row2Predict[Row2Predict.columns[~Row2Predict.columns.isin(['name', Variable2Predict])]].to_numpy()
    
    # Step 3: K-NN based on the observed variables
    PresentVariables = ~np.isnan(Xpred)
    NumberOfPresentVariables = np.sum(PresentVariables)
    NearestNeighbor = KNeighborsRegressor(n_neighbors=K).fit(X[:, PresentVariables[0]], Y)
    Yhat = NearestNeighbor.predict( np.reshape( Xpred[PresentVariables], (1,NumberOfPresentVariables) ) )
    print(Yhat)