machine-learning neural-network backpropagation feed-forward

FeedForward Neural Network: Using a single Network with multiple output neurons for many classes

I am currently working on the MNIST handwritten digits classification.

I built a single FeedForward network with the following structure:

Inputs: 28x28 = 784 inputs
Hidden Layers: A single hidden layer with 1000 neurons
Output Layer: 10 neurons

All the neurons have Sigmoid activation function.

The reported class is the one corresponding to the output neuron with the maximum output value

My questions are:

Is it a good approach to create a single network with multiple outputs? I.e. should I instead create a separated network per each digit?

I ask about it, as currently the network is stuck on ~75% success rate. As the actually "10 classifiers" share the same neurons of the hidden layer - I am not sure - does it reduce the network learning capability?

** EDIT: **

As other people may take reference of this thread, I want to be honest and update that the 75% success rate was after ~1500 epochs. Now I'm after nearly 3000 epochs and it's on ~85% of success rate - so it works pretty well

Solution

In short, yes it is a good approach to use a single network with multiple outputs. The first hidden layer describes decision boundaries (hyperplanes) in your feature space and multiple digits can benefit from some of the same hyperplanes. While you could create one ANN for each digit, that kind of one-vs-rest approach doesn't necessarily yield better results and requires training 10 times as many ANNs (each of which might be trained multiple times to try to avoid local minima). If you had hundreds or thousands of digits, then it might make more sense.

1000 neurons in a single hidden layer seems like a lot for this problem. I think you would probably achieve better results for handwritten digits by reducing that number and adding a second hidden layer. That would let you model more complex combinations boundaries in the input feature space. For example, perhaps try something like a 784x20x20x10 network.

If you do experiment with different network structures, it is usually better to start with a smaller number of layers & neurons and then increase complexity. That not only reduces training time but also avoids overfitting the data right away (you didn't mention if your accuracy was for a training or validation set).