Search code examples
pythonmachine-learningscikit-learnprediction

How to make machine learning predictions for empty rows?


I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):

Original dataset

I've created a straightforward model in order to predict the last column (Outcome).

#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)

#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Predicting the results for the whole dataset
y_pred2 = model.predict(data)

#Add prediction column to original dataset
data['prediction'] = y_pred2

However, I get the following error: ValueError: X has 9 features per sample; expecting 8.

My questions are:

  1. Why can't I create a new column with the predictions for my entire dataset?
  2. How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:

Rows to predict:
Rows to predict

Please let me know if my questions are clear!


Solution

  • You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.

    What you need to do is:

    1. Get predictions using X instead of data
    2. Append the predictions to your initial data set

    i.e.:

    y_pred2 = model.predict(X)
    data['prediction'] = y_pred2
    

    Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.

    If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):

    y_new = model.predict(X_new)
    data_new['prediction'] = y_new