scikit-learn logistic-regression one-hot-encoding

Encoding method of Logistic Regression in Scikit-learn

I am trying to use Logistic Regression to do some predicting task with Scikit-learn tool.

Hers are two example features of my task:

Feature 1(man, woman, unknow) ---categorical variable

Feature 2(number of clicking) ---continuous variable

I am not sure how to encode feature when I input data into Logistic Regression.

Should I use 1, 2 and 3 to represent categorical variable man, woman and unknow, or use (1, 0, 0), (0, 1, 0), (0, 0, 1) to represent them when I use Scikit-learn's Logistic Regression? And how about the continuous variable?

Solution

Feature 2 you should leave it as you have it.

Feature 1 is a little bit more tricky. When working with missing data, you either can drop the entire rows or try to impute values to the feature. I recommend you to read Imputing missing values before building an estimator from the Scikit-Learn documentation. This will show you an example of imputing data and testing that your prediction is improving. If you impute data do try adding a dummy variable for the row that has imputed data, I have successfully applied this specification in the past.