Search code examples
machine-learningneural-networkdata-analysis

How to deal with Qualitative Data in machine learning algorithms


Suppose I'm trying to use a neural network to predict how long my run will take. I have a lot of data from past runs. How many miles I plan on running, the total change in elevation (hills), the temperature, and the weather: sunny, overcast, raining, or snowing.

I'm confused on what to do with the last piece of data. For everything else I can input normally after standardizing, but I can't do that for the weather. My initial though was just to have 4 extra variables, one for each type of weather, and input put a 1 or a 0 depending on what it is.

Is this a good approach to the situation? are there other approaches I should try?


Solution

  • You have a categorical variable that has four levels.

    A very typical way of encoding such values is to use a separate variable for each one. Or, more commonly, "n-1" coding, where one less flag is used (the fourth value is represented by all being 0).

    n-1 coding is used for techniques that require numeric inputs -- including logistic regression and neural networks. For large values of "n", then it is a bad choice. The problem is that it creates many inputs of sparse data; sparse data is highly correlated. More inputs mean more degrees of freedom in the network, making the network harder to train.

    In your case, you only have four values for this particular input. Splitting it into three variables is probably reasonable.