python pandas machine-learning scikit-learn kaggle

How to use the test data against the trained model?

I'm a beginner in Machine Learning and I'm going through the Titanic competition. At first, my model gave me an accuracy of 1.0, which was too good to be true. Then I realized that I am comparing my trained model with the training data that I've used to train it and that my test data was nowhere to be found. This is why I think it gave me such an absurd number.

The following is my code:

import ...

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"

train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

train_data['Sex'] = pd.factorize(train_data.Sex)[0]

columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)

x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)

print(val_predictions)
print(accuracy_score(val_y, val_predictions))

I know that val_predictions need to have something to do with my test data but I'm not sure how to implement that.

Solution

train_test_split() is intended to take your dataset and split it into two chunks, the training and testing sets. In your case, you already have the data split into two chunks, in separate csv files. You are then taking the train data and splitting it again into train and val, which is short for validation (essentially the test or verification data).

You probably want to do the model.fit against your full training data set, and then call model.predict again the test set. There shouldn't be a need to do the call to train_test_split().

Edit:

I may be wrong here. In looking at the competition page, I'm realizing that the test set does not include the ground truth values. You can't use that data to validate your model accuracy. In that case, I think splitting the original training dataset into training and validation makes sense. Since you're fitting the model only on the train portion, the validation set is still unseen for the model. Then you are using the known values from the validation set to verify the predictions of your model.

The test set would be just used to generate 'new' predictions, since you don't have the ground truth value to verify.

Edit (in response to comment):

I don't have these data sets, and haven't actually run this code, but I'd suggest something like the following. Essentially you want to do the same preparation of your test data as what you are doing with the training data, and then feed it into your model the same way that the validation set was fed in.

import ...

def get_dataset(path):
    data = pd.read_csv(path)

    data['Sex'] = pd.factorize(data.Sex)[0]

    filtered_titanic_data = data.dropna(axis=0)

    return filtered_titanic_data

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"

train_data = get_dataset(train_path)
test_data = get_dataset(test_path)

columns_of_interest = ['Pclass', 'Sex', 'Age']

x = train_data[columns_of_interest]
y = train_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)

print(val_predictions)
print(accuracy_score(val_y, val_predictions))

text_x = test_data[columns_of_interest]
test_predictions = titanic_model.predict(test_x)

(Also, note that I removed the Survived column from the columns_of_interest. I believe by including that column in your x data, you were giving the model the value that it was attempting to predict, which is likely why you were getting 1.0 for the validation as well. You're giving it the answers to the test.)