I am doing a binary classification problem where I am predicting if a customer will subscribe for a campaign(For Airline Industry).
My data set is at Customer and Campaign name level and there are 43 variables under consideration.
There are certain variables which are decile ( 1 to 10) and variable like education level ( 0 to 5). For education level we can't say 4 is twice as educated as 2. How should I treat my variables?
Do i need to convert these variables to dummy variables( 0 or 1), I am running Logistic regression, random forest, Xgboost in R. How can I check variable importance if I convert these to dummy variables( factor analysis is throwing errors)
You do need dummy variables, in my opinion. How about converting educational level
into multiple variables like these:
educational level:1
educational level:2
educational level:3
and so on. Then you can give dummy variables for each of the variable.
For example,
educational level:1
yes:1 no:0
educational level:2
yes:1 no:0
Then fit your data into logistic model and try to resample it with some kind of ways like "cross validation". But I'm not quite sure about "variable importance"
, do you mean is this variable statistically significant or ...?