Search code examples
machine-learningregressionxgboostkaggle

Is overfitting always a bad thing?


I am currently participating in several machine learning competitions as I am trying to learn this field.

For a regression problem, I'm using a xgboost. Here is the procedure I use :

After feature engineering, I split my data into 2 sets, one training set, and one testing set, as usual. Then I fit my xgboost on the training set and validate on the testing set. Here are the results I get : (I also show the public set results when I used the trained algorithm to predict the target for submission, the metric is mae).

Iteration  training score  testing score  public score   
100        8.05            12.14          17.50
150        7.10            11.96          17.30

Remarks :

  • All data sets (train/test/public set) are roughly the same size, approximately 200 000 samples.

  • It is a time serie, so I didn't shuffle the data when splitting, even though shuffling it doesn't change the results..

  • I also tried to train my xgboost on the samples that are temporaly close to the public data set, but the results aren't better.

  • When I train on all data (train+test) before I submit, I got an improvement on public score of 0.35.

Here are my questions :

  • Can we really estimate the over-fitting by just looking at the difference between training and testing score? Or is it only an indicator of over-fitting?

  • Why does my submission score improved by increasing the number of iterations even if it shows that I'm increasingly over-fitting?

  • Why is the improvement in submission score even better than the improvement in the testing score?

  • Why testing score and submission score are not closer? Why do I have systemically a big difference between testing and submission score, regardless of the hyperparameters or "over-fitting rate".

Is this statement true: If the ratio of useful information learned over the useless information (training set specific information) is greater than 1, then you can continue over-fitting and still improve the model?

I hope it's not too confusing, I'm sorry I may not have the proper vocabulary. I have to mention that, even with over-fitting and the big difference between testing and public score, I still am currently the second on the leaderboard with 50 people participating.


Solution

  • First of all understand what over-fitting is.

    you can see over fit when the training score is increasing (or error is decreasing) while your testing set score is decreasing (or error is increasing)

    Over fit is when your training model is too precise and doesn't generalize on the problem you try to solve. In other words its too FIT for the training, and the training alone, so that it cannot solve/predict a different data set.

    In your example it seems like both the errors of the train and the test are decreasing, which means you are not over fitting.

    Over fitting is always a bad thing.

    As for your current problem. If you want to run multiple cross validations, or manually split your data for many training and many testing sets you can do the following:

    1. Split the data for training and testing (50%, 50%) or (70%, 30%), what ever you think is right for you
    2. Then, randomly sample with X% your training data, and it'll be the training set.
    3. Randomly sample the testing data with Y% and it'll be your testing set
    4. I suggest X = Y = 75%. And the above split to be 70% training and 30% testing.

    As for your questions:

    1. It is just an indicator for over fit.
    2. You are not over fitting by your example
    3. Score will varies on different data sets
    4. Same answer as 3

    Adding a picture to describe over fitting: enter image description here

    There is a point (10) in complexity where keep training will decrease the training error but will INCREASE the testing error.