I have been trying to construct a dataframe that has a column with the predicted value from a model with no success.
For the sake of a simple example I will use the iris dataset:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(np.concatenate((iris.data, np.array([iris.target]).T), axis=1), columns=iris.feature_names + ['target'])
df.head()
This will output:
For the next steps of building a model I will have
# Get the x and y for the experiment
X = df.drop('target', 1).values
y = df["target"].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
#Create an XGB classifier and instance of the same
from xgboost import XGBClassifier
clf = XGBClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
At this point I am blocked. I have looked into a few of the posts on SO on how to retrieve the index/id of the individual datapoints (each row being a datapoint) but with no success.
Is there anyway that I can match the predictions to each row? Or as an alternative, to test individual rows so I can know what they were predicted for?
A simple way to do so is to retain your X
and y
as dataframes (i.e. remove .values
):
X = df.drop('target', 1)
y = df["target"]
# rest of your code as is
So, after running the rest of your code, i.e. fitting the model and getting the predictions y_pred
, you can add back to your X_test
(which is now a dataframe) the target
and prediction
columns:
X_test = X_test.assign(target = y_test.values)
X_test = X_test.assign(prediction = y_pred)
print(X_test.head())
# result:
sepal length (cm) sepal width (cm) ... target prediction
14 5.8 4.0 ... 0.0 0.0
98 5.1 2.5 ... 1.0 1.0
75 6.6 3.0 ... 1.0 1.0
16 5.4 3.9 ... 0.0 0.0
131 7.9 3.8 ... 2.0 2.0
[5 rows x 6 columns]