Search code examples
random-forestmodeling

Train Test Valid data sets... General question about fitting the models


So I was given Xtrain, ytrain, Xtest, ytest, Xvalid, yvalid data for a HW assignment. This assignment is for a Random Forest but I think my question can apply to any/most models.

So my understanding is that you use Xtrain and ytrain to fit the model such as (clf.fit(Xtrain, ytrain)) and this creates the model which can provide you a score and predictions for your training data

So when I move on to Test and Valid data sets, I only use ytest and yvalid to see how they predict and score. My professor provided us with three X dataset (Xtrain, Xtest, Xvalid), but to me I only need the Xtrain to train the model initially and then test the model on the different y data sets.

If i did .fit() for each pair of X,y I would create/fit three different models from completely different data so the models are not comparable from my perspective.

Am I wrong?


Solution

  • Training step :

    Assuming your are using sklearn, the clf.fit(Xtrain, ytrain) method enables you to train your model (clf) to best fit the training data Xtrain and labels ytrain. At this stage, you can compute a score to evaluate your model on training data, as you said.

    #train step
    clf = your_classifier
    clf.fit(Xtrain, ytrain)
    

    Test step :

    Then, you have to use the test data Xtest to feed the prior trained model in order to generate new labels ypred.

    #test step
    ypred = clf.predict(Xtest)
    

    Finally, you have to compare these generated labels ypred with the true labels ytest to provide a robust evaluation of the model performance on unknown data (data not used during training) with tools like confusion matrix, metrics...

    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
    
    test_cm = confusion_matrix(ytest,ypred)
    test_report = classification_report(ytest,ypred)
    test_accuracy = accuracy_score(ytest, ypred)