Search code examples
pythondataframemachine-learningdata-cleaningsklearn-pandas

How to fill missing value using pre-trained model?


I have a time series index with few variables and humidity reading. I have already trained an ML model to predict Humidity values based on X, Y and Z. Now, when I load the saved model using pickle, I would like to fill the Humidity missing values using X, Y and Z. However, it should consider the fact that X, Y and Z themselves shouldnt be missing.

Time                    X        Y        Z       Humidity
1/2/2017 13:00          31       22       21           48
1/2/2017 14:00          NaN      12       NaN          NaN
1/2/2017 15:00          25       55       33           NaN

In this example, the last row of humidity will be filled using the model. Whereas the 2nd row should not be predicted by the model since X and Z is also missing.

I have tried this so far:

with open('model_pickle','rb') as f:
    mp = pickle.load(f)

for i, value in enumerate(df['Humidity'].values):
    if np.isnan(value):
        df['Humidity'][i] = mp.predict(df['X'][i],df['Y'][i],df['Z'][i])

This gave me an error 'predict() takes from 2 to 5 positional arguments but 6 were given' and also I did not consider X, Y and Z column values. Below is the code I used to train the model and save it to a file:

df = df.dropna()

dfTest = df.loc['2017-01-01':'2019-02-28']
dfTrain = df.loc['2019-03-01':'2019-03-18'] 
features = [ 'X', 'Y', 'Z'] 
train_X = dfTrain[features]
train_y = dfTrain.Humidity
test_X = dfTest[features]
test_y = dfTest.Humidity

model = xgb.XGBRegressor(max_depth=10,learning_rate=0.07)
model.fit(train_X,train_y)
predXGB = model.predict(test_X)
mae = mean_absolute_error(predXGB,test_y)
import pickle
with open('model_pickle','wb') as f:
    pickle.dump(model,f)

I had no errors during training and saving the model.


Solution

  • For prediction, since you want to make sure you have all the X, Y, Z values, you can do,

    df = df.dropna(subset = ["X", "Y", "Z"])
    

    And now you can predict the values for the remaining valid examples as,

    # where features = ["X", "Y", "Z"]
    df['Humidity'] = mp.predict(df[features]) 
    

    mp.predict will return prediction for all the rows, so there is no need to predict iteratively.

    Edit:.

    For inference, say you have a dataframe df, you can do,

    # Get rows with missing Humidity where it can be predicted.
    df_inference = df[df.Humidity.isnull()]
    
    # remaining rows
    df = df[df.Humidity.notnull()]
    
    # This might still have rows with missing features.
    # Since you cannot infer with missing features, Remove them too and add them to remaining rows
    df = df.append(df_inference[df_inference[features].isnull().any(1)])
    
    # and remove them from df_inference
    df_inference = df_inference[~df_inference[features].isnull().any(1)]
    
    #Now you can infer on these rows
    df_inference['Humidity'] = mp.predict(df_inference[features])
    
    # Now you can merge this back to the remaining rows to get the original number of rows and sort the rows by index
    df = df.append(df_inference)
    df.sort_index()