I am trying to use Logistic Regression to do some predicting task with Scikit-learn tool.
Hers are two example features of my task:
Feature 1(man, woman, unknow) ---categorical variable
Feature 2(number of clicking) ---continuous variable
I am not sure how to encode feature when I input data into Logistic Regression.
Should I use 1, 2 and 3 to represent categorical variable man, woman and unknow, or use (1, 0, 0), (0, 1, 0), (0, 0, 1) to represent them when I use Scikit-learn's Logistic Regression? And how about the continuous variable?
Feature 2 you should leave it as you have it.
Feature 1 is a little bit more tricky. When working with missing data, you either can drop the entire rows or try to impute values to the feature. I recommend you to read Imputing missing values before building an estimator from the Scikit-Learn documentation. This will show you an example of imputing data and testing that your prediction is improving. If you impute data do try adding a dummy variable for the row that has imputed data, I have successfully applied this specification in the past.