neural-network classification multinomial

How can a well trained ANN have a single set of weights that can represent multiple classes?

In multinomial classification, I'm using soft-max activation function for all non-linear units and ANN has 'k' number of output nodes for 'k' number of classes. Each of the 'k' output nodes present in output layer is connected to all the weights in preceding layer, kind of like the one shown below.

So, if the first output node intends to pull the weights in it's favor, it will change all the weights that precede this layer and the other output nodes will also pull which usually contradicts to the direction in which the first one was pulling. It seems more like a tug of war with single set of weights. So, do we need a separate set of weights(,which includes weights for every node of every layer) for each of the output classes or is there a different form of architecture present? Please, correct me if I'm wrong.

Solution

Each node has its set of weights. Implementations and formulas usually use matrix multiplications, which can make you forget the fact that, conceptually, each node has its own set of weights, but they do.

Each node returns a single value that gets sent to every node in the next layer. So a node on layer h receives num(h - 1) inputs, where num(h - 1) is the number of nodes in layer h - 1. Let these inputs be x1, x2, ..., xk. Then the neuron returns:

x1*w1 + x2*w2 + ... + xk*wk

Or a function of this. So each neuron maintains its own set of weights.

Let's consider the network in your image. Assume that we have some training instance for which the topmost neuron should output 1 and the others 0.

So our target is:

y = [1 0 0 0]

And our actual output is (ignoring the softmax for simplicity):

y^ = [0.88 0.12 0.04 0.5]

So it's already doing pretty well, but we must still do backpropagation to make it even better.

Now, our output delta is:

y^ - y = [-0.12 0.12 0.04 0.5]

You will update the weights of the topmost neuron using the delta -0.12, of the second neuron using 0.12 and so on.

Notice that each output neuron's weights get updated using these values: these weights will all increase or decrease in order to approach the correct values (0 or 1).

Now, notice that each output neuron's output depends on the outputs of hidden neurons. So you must also update those. Those will get updated using each output neuron's delta (see page 7 here for the update formulas). This is like applying the chain rule when taking derivatives.

You're right that, for a given hidden neuron, there is a "tug of war" going on, with each output neuron's errors pulling their own way. But this is normal, because the hidden layer must learn to satisfy all output neurons. This is a reason for initializing the weights randomly and for using multiple hidden neurons.

It is the output layer that adapts to give the final answers, which it can do since the weights of the output nodes are independent of each other. The hidden layer has to be influenced by all output nodes, and it must learn to accommodate them all.