I am trying to learn ML techniques in Python using Spaceship Titanic.
What I am trying to do is to perform a 3-fold cross-validation and predict the target variable (Transported
) using features from test.csv
. The only thing that I can do is to teach a model on my training set as it contains both my features and my response. What I am trying to do:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, KFold
from sklearn.neighbors import KNeighborsClassifier
X, y = train_ready.drop('Transported', axis=1), train_ready['Transported']
# 3-Fold Cross-Validation -----
cross_validation = KFold(n_splits=3, random_state=2022, shuffle=True)
classifier = KNeighborsClassifier(n_neighbors=10)
scores = cross_val_score(classifier, X, y, cv=cross_validation)
y_pred = cross_val_predict(classifier, X, y, cv=cross_validation)
y_test_predictions = cross_val_predict(classifier, test_ready, cv=cross_validation)
> TypeError: fit() missing 1 required positional argument: 'y'
And, obviously, I cannot predict my target from the test.csv
dataset as it does not have this column. What is the right algorithm for this task and what am I doing wrong?
P.S. I will kindly appreciate your patience as I am new to ML in Python and its syntax; previous experience was primarily in R.
You can think of it like this, cross validation is used to determine the best model and optimize hyper parameters. Once you have determined which model and hyperparameters you train the model one more time with the full dataset and do predictions on the unknown data. So when making the final predictions you shouldn't try to use any cross validation function. Instead you should do something like this
classifier = KNeighborsClassifier(n_neighbors=10)
classifier.fit(X,y)
y_test_predictions = classifier.predict(test_ready)
You could ofcourse hold out some training data as a sanity check that the model doesn't overfit before making final predictions on the unknown dataset altough the cross validation should have you convinced that is not going to be the case.