python pandas machine-learning scikit-learn logistic-regression

Logistic Regression test input format help in python

I do have the below dataset.

I've created Logistic Regression out of it and checked Accuracy and is working fine. So now requirement is I've a new data with Age 30 and EstimatedSalary 50000 and I would like to predict whether Purchased will be 0 or 1. How to pass the new values 30 and 50000 in my python code.

Below is the python code which I've used.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline

dataset = pd.read_csv(r"suv_data.csv")

X=dataset.iloc[:,[0,1]].values
y=dataset.iloc[:,2].values

X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=1)

sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

y_pred=classifier.predict(X_test)

accuracy_score(y_test,y_pred)*100

Regards,

Bharath Vikas

Solution

In general, to evaluate (i.e. call .predict in sklearn) a trained model, you need to input samples that have the same shape as the samples the model was trained on.

In your case I suppose (see my comment on your question) you wanted to have samples with Age and EstimatedSalary in the training set using Purchased as label.

Then, to test on a single sample just try this:

single_test_sample = pd.DataFrame({'Age':[30], 'EstimatedSalary':[50000]}).iloc[:,[0,1]].values
single_test_sample = sc.transform(single_test_sample)
single_test_prediction = classifier.predict(single_test_sample)

Note that you can also add more values in the test dataframe Age and EstimatedSalary columns, now I only added the sample you were interested in. If you add more, the model will output a prediction for each row in the test dataframe.

Also note that your code and mine, will also work without this .values at the end of the train/test set as sklearn already provides functionality with pandas dataframes.