Search code examples
pythonmachine-learningneural-networkregressiontrain-test-split

Why my model work ok with test data from train_test_split while doesn't with the new data?


I am new to machine learning.

I have a continuous dataset. I am trying to model the target label using several features. I utilize the train_test_split function to separate the train and the test data. I am training and testing the model using the code below:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_test.values,y_test.values), epochs=200, batch_size=64, verbose=1) 

I can get good results when I use X_test and y_test for validation data:

https://drive.google.com/open?id=0B-9aw4q1sDcgNWt5TDhBNVZjWmc

However, when I use this model to predict another data (X_real, y_real) (which are not so different from the X_test and y_test except that they are not randomly chosen by train_test_split) I get bad results:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_real.values,y_real.values), epochs=200, batch_size=64, verbose=1) 

https://drive.google.com/open?id=0B-9aw4q1sDcgYWFZRU9EYzVKRFk

Is it an issue of overfitting? If it is so, why does my model work ok with the X_test and y_test generated by train_test_split?


Solution

  • Seems that your "real data" differs from your train and test data. Why do you have "real" and "training" data in the first place?

    My approach would be:

    1: Mix up all Data you have

    2: Devide your Data randomly in 3 sets (train, test and validate)

    3: use train and test like you do it now and optimize your classifier

    4: When it's good enough validate the classifier with your validation set to make sure no overfitting occurs.