python keras deep-learning neural-network artificial-intelligence

Python ANN - Getting High Accuracy Results Even with over-complex model

I am newbie to Python Machine Learning and I have an ANN task at hand. I have developed model using Keras library and sets of different layers in terms of Activation Function & Number Of Neurons.

The Task is to classify dataset which has 11 features and the output column has categorical data which consists of three classes. (Multilabel Classification I think).

The problem is the performance of model exceeds my expectations and it yields performance metrics as following :

F1 score : 0.918 Accuracy : 0.924 Precision : 0.897 Recall : 0.924

How can I prove the model is not overfitting?
Should encode all categorical data such as "Gender" or should I just remove the feature?
And when scaling dataset , should scale the Label-Encoded Categorical Data too?
With dataset being imbalance, does it affect the way to implement classification? and should I balance it?

I have tried different options with trial & error but I don't know the correct way to implement this Multilabel Classification.

I have also tried implementing an over-complex model to see how overfitting happens but strange thing is I got better results.

I was wondering if anyone could help me here.

Solution

To answer your questions:

1. You should use train/validation/test split.

Train split is obviously used for training - it's the data that is used for loss calculation. Your model is optimizing to get the best score on this dataset.

Validation is used for validating how your model would perform on some unseen data during training - it's the split you use for checking if your model is not under- or overfitting. You fiddle with your model (optimizing hyperparameters like learning rate, number of layers etc.) looking mostly at scores on validation split. You are trying to build your model in a way that gets the best scores on this split.

Validation metrics are calculated on training epoch end, same as for training metrics. This allows you to see when validation metrics hit plateau or start getting worse (model is overfitting). On training data you should see improvement every epoch, because that's what your model is supposed to do (optimize loss on training data).

Overfitting example

As you can in the picture above, orange line (validation) starts getting worse loss, when training loss is still getting better. This is an example of overfitting. In the most settings, you will see overfitting quite fast for small-ish data or easy tasks.

You can easily setup training with validation in Keras and track both train and validation metrics. You can use validation_split argument in model.fit() method directly to specify how much of your data should be used for validation (f.e. validation_split=0.2 will use 20% of your data for validation).

Test dataset is used for testing model on the part of data that you don't use during model development. Model is not trained on it and you are not trying to tune your parameters for the highest score on test data. It's supposed to represent some real-life scenario for getting new data when your model is deployed. You should only use it when you are done with developing your model - so you have tuned your parameters to get the best scores on validation dataset. Test loss is useful for comparing different models.

It depends. To make any use of this data you should obviously encode it. However, it might turn out, that this feature is not useful at all for your model performance. There are techniques for checking future importance. In your case you could check correlation of those features with your target or other features. It's a very popular topic in machine learning and you could easily find some great materials.
By scaling you mean normalizing/standardizing? If so, you could obviously test it, but it shouldn't be that much of an issue with neural network and 0/1 features like gender.
Dataset imbalance is a very serious issue for classification task. It is very important to use proper metrics to score your models in such setup. Classical example is the case with cancer detection. If your dataset consists of pictures, where 99% of contain healthy people and only 1% contains cancer, you could be in a situation where your model always predict that someone is healthy and still get 99% accuracy, even tho it never detected that someone had cancer. You are already using precision and recall and your scores are not bad, so it might be not the case. Sometimes your classifier will learn more easily if you provide it more balanced data and you can learn how train such classifier in this tutorial from the TensorFlow team.

It's a bit out of the scope of your question, but the truth is that neural networks sometimes aren't the best tool. For tabular data there are many different methods used, where the most popular ones are XGBoost or LightGBM. You could even start with the simple Random Forest or Decision Trees from sklearn. You might want to try them out for your problem. Hovewer, most of the information in my answer is still relevant for training them.