I have been trying to use XGBregressor in python. It is by far one of the best ML techniques I have used.However, in some data sets I have very high training R-squared, but it performs really poor in prediction or testing. I have tried playing with gamma, depth, and subsampling to reduce the complexity of the model or to make sure its not overfitted but still there is a huge difference between training and testing. I was wondering if someone could help me with this:
Below is the code I am using:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=100)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
xgb = xgboost.XGBRegressor(colsample_bytree=0.7,
gamma=0,
learning_rate=0.01,
max_depth=1,
min_child_weight=1.5,
n_estimators=100000,
reg_alpha=0.75,
reg_lambda=0.45,
subsample=0.8,
seed=1000)
Here is the performance in training vs testing:
Training : MAE: 0.10 R^2: 0.99
Testing: MAE: 1.47 R^2: -0.89
The issue here is overfitting. You need to tune some of the parameters(Source).
- set n_estimators to 80-200 if the size of data is high (of the order of lakh), 800-1200 is if it is medium-low
- learning_rate: between 0.1 and 0.01
- subsample: between 0.8 and 1
- colsample_bytree: number of columns used by each tree. Values from 0.3 to 0.8 if you have many feature vectors or columns , or 0.8 to 1 if you only few feature vectors or columns.
- gamma: Either 0, 1 or 5
Since max_depth you have already taken very low, so you can try to tune above parameters. Also, if your dataset is very small then the difference in training and test is expected. You need to check whether within training and test data a good split of data is there or not. For example, in test data whether you have almost equal percentage of Yes and No for the output column.
You need to try various option. certainly xgboost and random forest will give overfit model for less data. You can try:-
1.Naive bayes. Its good for less data set but it considers the weigtage of all feature vector same.
Logistic Regression - try to tune the regularisation parameter and see where your recall score max. Other things in this are calsss weight = balanced.
Logistic Regression with Cross Validation - this is good for small data as well. Last thing which I told earlier also, check your data and see its not biased towards one kind of result. Like if the result is yes in 50 cases out of 70, it is highly biased and you may not get high accuracy.