What values to use on a neural network analysis?

I have the following exercise:

create a neural network using the k-fold cross validation. Evaluate the performance for different configurations.

After this, i should compare the values with the values obtained using the decision tree model, for the same data.

for the given data:

I have to normalize the values, and i am able to normalize them. But the question is, does it make sense to change the values in the race field like:

Asian - 0
White - 1
Other - 2

and then normalize the values (0 to 1), or should i just use the fields "age", "salary" and "academic level"?

The dependent variable will be a new column, dividing the salary in "high" and "low".

What fields does it make sense to normalize and use in the neural network? Can i use normalize all of them and use all fields in the neural network?

Solution

This is an interesting question. When working with neural networks (with a modern perspective) it is generally best to use as much of your data as possible, and minimize the amount of manual preprocessing.

Option 1 is the worst: just work with numeric attributes (normalized).

Option 2 takes this a step further: also use categorical attributes where order is obvious. I suppose this is what you intend to do with "academic level". In these cases you could try to translate these values to normalized numbers. Not ideal, but better than not using them.

Option 3: For categorical attributes where order makes little sense you could create a boolean attribute for each option! This seems scary as it rapidly increases dimensionality, but it is often a good approach. For example, if you have 4 job categories, you could try working with 4 columns, one for each job option.

Option 4: By far the best approach (but also the hardest to implement) is to use embeddings. This is similar to an idea that revolutionized the use of deep learning in natural language processing.

The problem with language is similar to your own problem: how to numericalize input words. The first approach is to translate each word in the string into a vector, where the length of the vector is the number of words in the vocabulary, and all is 0 except for the index of the current word. This is called one-hot-encoding. Imagine our vocabulary is "Russia, Apple, Lake, Pear". Then the word "Apple" would be encoded by [0, 1, 0, 0]. Good, but this erases a lot of information. For example, Apple is more similar to Pear than to Russia, but [0, 1, 0, 0] is as similar to [0, 0, 0, 1] as it is to [1, 0, 0, 0].

But we can use dense vectors! (Like [0.12, 0.42, -0.01, 0.9].) For example, it is typical to encode any word in the dictionary as a 300-dimensional vector. The subtleties of semantic similarity and meaning will be encoded in the different dimensions of the vector.

So... why not do the same with your problematic attributes? Ordering race input like you propose makes no sense, and it may confuse the algorithm. Why "Asian" the highest value? Why "Black" between "Hispanic" and "White"? (Using a social construct like race highlights why this is problematic.)

Now word embeddings are often pretrained and are reused. In your case you need to train these vectors as part of your model's parameters (look for "embedding layers"). It may not be trivial to implement this from scratch, but it's good that you are at least aware of the possibilities. In case you want to give this idea a try, I would suggest looking into Tabular Learning for Fastai, that really makes all of this very approachable even for people with not much experience.