Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:
A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1
The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:
x = 0
y = 0.5
z = 1
Do I just find the nearest value to the output (0.435) which is y (0.5)?
You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:
Nominal vs ordinal variables
Suppose A
, B
, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .
Now suppose A
, B
, C
, etc stand for garment sizes, e.g. small
, medium
, large
, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small
corresponds to 40 a chest circumference), it is clear that small
< medium
< large
. With that in mind, it is still somewhat arbitrary whether you set small=1
, medium=2
, large=3
, or small=2
, medium=4
, large=8
.
One-of-N encoding
A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1
or 0
. Continuing with my garments example, size extra small
can be encoded as 10000
, small
as 01000
, medium as 00100
, etc.
A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002]
, you interpret that as size medium
.
In reply to your comment on @Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1]
, because they may end out dominating your discrete variable.