Criteria Behind Structuring a Neural Network

I'm just starting with Torch and neural networks and just glancing at a lot of sample code and tutorials, I see a lot of variety in the how people structure their neural networks. There are layers like Linear(), Tanh(), Sigmoid() as well as criterions like MSE, ClassNLL, MultiMargin, etc.

I'm wondering what kind of factors people keep in mind when creating the structure of their network? For example, I know that in a ClassNLLCriterion, you want to have the last layer of your network be a LogSoftMax() layer so that you can input the right log probabilities.

Are there any other general rules or guidelines when it comes to creating these networks?

Thanks

Solution

Here is a good webpage which contains the pros and cons of some of the main activation functions;

http://cs231n.github.io/neural-networks-1/#actfun

It can boil down to the problem at hand and knowing what to do when something goes wrong. As an example, if you have a huge dataset and you can't churn through it terribly quickly then a ReLU might be better in order to quickly get to a local minimum. However you could find that some of the ReLU units "die" so you might want to keep a track on the proportion of activated neurons in that particular layer to make sure this hasn't happened.

In terms of criterions, they are also problem specific but a bit less ambiguous. For example, binary cross entropy for binary classification, MSE for regression etc. It really depends on the objective of the whole project.

For the overall network architecture, I personally find it can be a case of trying out different architectures and seeing which ones work and which don't on your test set. If you think that the problem at hand is terribly complex and you need a complex network to solve the problem then you will probably want to try making a very deep network to begin with, then add/remove a few layers at a time to see if you have under/overfitted. As another example, if you are using convolutional network and the input is relatively small then you might try and use a smaller set of convolutional filters to begin with.