Search code examples
machine-learningsupervised-learningtrain-test-split

Train data and test data that have target column


I'm trying to make some predictive model using Baking Dataset - Marketing Targets from kaggle here is the link : https://www.kaggle.com/datasets/prakharrathi25/banking-dataset-marketing-targets

The dataset from kaggle already been separated into train data csv and testing data csv. But both csv files have a target column y. Should i concatenate both of them into 1 data frame before start EDA and preprocessing? Then use train_test_split from sklearn library when creating model?

The second question is i've also seen some dataset from kaggle like this : https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction also already separated into train data csv and test data csv. But the difference is the test data csv doesnt have the target column (Response). That makes me think i can't concatenate both datasets.

Can someone please explain to me?


Solution

  • If you already have a dataset divided into a training and test set there is no need to concatenate it and divide it again. You can directly use the training set for training and the test set for testing the model. The target column is needed in both training and test sets if you are doing supervised learning:

    • in training, you use the target column to "teach" the model how to behave, and how to adjust its internal weights to reflect the expected output.
    • in testing, you check how the current model performs on unseen data. You give each row of the test set in input to your model (without the target column) and check if the target variable is returned as expected.

    Clearly, if you are preprocessing (manipulating) the data (through features selection, features encoding, or others) you have to make sure to execute the same pipeline on both the training and test sets.

    For the second question, is your dataset used for a competition? In that case, the test set can not have the target labels in order to avoid participants to use the test set in the model learning phase (is like cheating). The test set is usually used to assess the model performance. Here I suppose you have to use it just to make predictions and show the result of your model (without directly evaluating it).