Overfitting and data leakage in tensorflow/keras neural network

Good morning, I'm new in machine learning and neural networks. I am trying to build a fully connected neural network to solve a regression problem. The dataset is composed by 18 features and 1 label, and all of these are physical quantities.

You can find the code below. I upload the figure of the loss function evolution along the epochs (you can find it below). I am not sure if there is overfitting. Someone can explain me why there is or not overfitting?

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from keras import optimizers
from sklearn.metrics import r2_score
from keras import regularizers
from keras import backend
from tensorflow.keras import regularizers
from keras.regularizers import l2

# =============================================================================
# Scelgo il test size
# =============================================================================
test_size = 0.2

dataset = pd.read_csv('DataSet.csv', decimal=',', delimiter = ";")

label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])

y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)

def denormalize(y):
    final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
    return final_value

# =============================================================================
# Split
# =============================================================================

X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = test_size, shuffle = True)

y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()

# =============================================================================
# Normalizzo
# =============================================================================
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)

scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)

# =============================================================================
# Creo la rete
# =============================================================================
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model = Sequential()

model.add(Dense(60, input_shape = (X_train.shape[1],), activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))

model.add(Dense(1,activation = 'linear',kernel_initializer='glorot_uniform'))

model.compile(loss = 'mse', optimizer = optimizer, metrics = ['mse'])

history =, y_train, epochs = 100,
                    validation_split = 0.1, shuffle=True, batch_size=250

history_dict = history.history

loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

y_train_pred = denormalize(y_train_pred)
y_test_pred = denormalize(y_test_pred)

plt.plot((y_test1),(y_test_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')

plt.plot((y_train1),(y_train_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')

plt.plot(loss_values,'b',label = 'training loss')
plt.plot(val_loss_values,'r',label = 'val training loss')
plt.ylabel('Loss Function')

print("\n\nThe R2 score on the test set is:\t{:0.3f}".format(r2_score(y_test_pred, y_test1)))

print("The R2 score on the train set is:\t{:0.3f}".format(r2_score(y_train_pred, y_train1)))
from sklearn import metrics

# Measure MSE error.  
score = metrics.mean_squared_error(y_test_pred,y_test1)
print("\n\nFinal score test (MSE): %0.4f" %(score))
score1 = metrics.mean_squared_error(y_train_pred,y_train1)
print("Final score train (MSE): %0.4f" %(score1))
score2 = np.sqrt(metrics.mean_squared_error(y_test_pred,y_test1))
print(f"Final score test (RMSE): %0.4f" %(score2))
score3 = np.sqrt(metrics.mean_squared_error(y_train_pred,y_train1))
print(f"Final score train (RMSE): %0.4f" %(score3))

I tried alse to do feature importances and to raise n_epochs, these are the results:

Feature Importance:

No Feature Importace:

  • Looks like you don't have overfitting! Your training and validation curves are descending together and converging. The clearest sign you could get of overfitting would be a deviation between these two curves, something like this: overfitting ecample

    Since your two curves are descending and are not diverging, it indicates your NN training is healthy.

    HOWEVER! Your validation curve is suspiciously below the training curve. This hints a possible data leakage (train and test data have been mixed somehow). More info on a nice an short blog post. In general, you should split the data before any other preprocessing (normalizing, augmentation, shuffling, etc...).

    Other causes for this could be some type of regularization (dropout, BN, etc..) that is active while computing the training accuracy and it's deactivated when computing the Validation/Test accuracy.