tensorflow machine-learning keras neural-network lstm

Effect of number of nodes in LSTM

I am new to machine learning and I built a neural network with 2 dense layers. When I was experimenting, I had the following observations:

When I decreased the number of nodes in each dense layer, I seemed to get better training and prediction accuracy. This was surprising to me because I would assume the more nodes in a dense layer, the more the model can understand the data. Why does decreasing node number improve accuracy?
The model also yielded better results when the number of nodes in each dense layer was not consistent. For example, I got the best result when one dense layer had 5 nodes and the other layer had 10, than both layers having 5 nodes or 10 nodes. Why is that? Is it common that inconsistent node counts in the dense layers improve accuracy?

Solution

To answer your questions sequentially:

a) When you decreased the number of neurons in each dense layer and you got better training and accuracy, you reduced the overfitting phenomenon in your problem. The act of removing some neurons from your layers behaved like a regularizer on your problem, and thus mitigated the overfitting effect. This is not an uncommon situation; according to your dataset and overall architecture of the neural network, decreasing the number of neurons in some layers may very well lead to better generalization of your model.

b) The answer a) does not apply if only the training accuracy improved when decreasing the number of nodes, since overfitting increses training accuracy, but reduces the test/holdout-accuracy.

The second question is case-dependent; When building neural networks from scratch, there is no guarantee that your problem will work better with approach A or approach B; this is why we do hyperparameter search and optimization, in order to seek for the best overall parameters in order to minimize our loss on the validation set.

For common heuristics applied when build a model from scratch, particularly with Dense layers, please consult the next link: https://towardsdatascience.com/17-rules-of-thumb-for-building-a-neural-network-93356f9930af. Some of the heuristics applicable are available for Dense layers as a whole; it does not matter if the input, like in your problem, will come from an LSTM processing.