Search code examples
pythonanacondaregressionsklearn-pandas

how to predict Na's in python using linear regression


I have a dataset that is missing some Y values which I would like to predict. Hence, I dropped the Na's in order to first create a model, using this code -> RBall.dropna(subset=['NextHPPR'], inplace = True

import statsmodels.api as sm 
from sklearn import linear_model

RBall.dropna(subset=['NextHPPR'], inplace = True)

X = RBall[['ReceivingTargets_x','SnapsPlayedPercentage','RushingAttempts_x', 'RushingAttempts_y']]

Y = RBall['NextHPPR']

lm = linear_model.LinearRegression()
model = lm.fit(X,Y)

Here is a screenshot of my data before removing NAs. Note the NA's in NextHPPR, my Y variable in the regression

Now, I would like to use my model to go back and predict the missing Na's. I understand it's an elementary question, but this is my first day using python. Thank you.


Solution

  • I would use NumPy to find the index of the NaNs and then call predict.

    import numpy as np 
    
    X = np.array([432, 234442, 43, 423, 2342, 3434])
    Y = np.array([342, np.NaN, 23, 545, np.NaN, 23])
    
    nan_idx = np.argwhere(np.isnan(Y)).flatten()
    
    print(X[nan_idx])
    >>>[234442   2342]
    
    predict_NaNs = lm.predict(X[nan_idx])