python machine-learning scikit-learn prediction

How to make machine learning predictions for empty rows?

I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):

Original dataset

I've created a straightforward model in order to predict the last column (Outcome).

#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)

#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Predicting the results for the whole dataset
y_pred2 = model.predict(data)

#Add prediction column to original dataset
data['prediction'] = y_pred2

However, I get the following error: ValueError: X has 9 features per sample; expecting 8.

My questions are:

Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:

Rows to predict:

Please let me know if my questions are clear!

Solution

You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.

What you need to do is:

Get predictions using X instead of data
Append the predictions to your initial data set

i.e.:

y_pred2 = model.predict(X)
data['prediction'] = y_pred2

Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.

If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):

y_new = model.predict(X_new)
data_new['prediction'] = y_new