java arrays neural-network backpropagation

Java - normalize and denormalize nominal attributes in neural networks

Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:

A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1

The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:

x = 0
y = 0.5
z = 1

Do I just find the nearest value to the output (0.435) which is y (0.5)?

Solution

You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:

Nominal vs ordinal variables

Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .

Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.

One-of-N encoding A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.

A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.

In reply to your comment on @Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.