Search code examples
rclassificationrandom-forestlogistic-regressiondummy-variable

When to take dummy variables in classification problems?


I am doing a binary classification problem where I am predicting if a customer will subscribe for a campaign(For Airline Industry).

My data set is at Customer and Campaign name level and there are 43 variables under consideration.

There are certain variables which are decile ( 1 to 10) and variable like education level ( 0 to 5). For education level we can't say 4 is twice as educated as 2. How should I treat my variables?

Do i need to convert these variables to dummy variables( 0 or 1), I am running Logistic regression, random forest, Xgboost in R. How can I check variable importance if I convert these to dummy variables( factor analysis is throwing errors)


Solution

  • You do need dummy variables, in my opinion. How about converting educational level into multiple variables like these:

    educational level:1

    educational level:2

    educational level:3

    and so on. Then you can give dummy variables for each of the variable.

    For example,

    educational level:1 yes:1 no:0

    educational level:2 yes:1 no:0

    Then fit your data into logistic model and try to resample it with some kind of ways like "cross validation". But I'm not quite sure about "variable importance", do you mean is this variable statistically significant or ...?