I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?
I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred
after prediction to check which datapoints have been misclassified.