Search code examples
pythonmachine-learningscikit-learnclassificationcategorical-data

Using Categorical Predictor Variables in sci-kit learn


Basic question here:

I'm trying to implement a simple classification model for credit card default where I just use model.fit, model.predict on my input data. However, that input data contains both categorical data (like demographic information like Age, Married or Not, Education level) and continuous data (like credit balances).

data.info()

<div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt;
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
LIMIT_BAL    30000 non-null float64
SEX          30000 non-null int64
EDUCATION    30000 non-null int64
MARRIAGE     30000 non-null int64
AGE          30000 non-null int64
PAY_1        30000 non-null int64
PAY_2        30000 non-null int64
PAY_3        30000 non-null int64
PAY_4        30000 non-null int64
PAY_5        30000 non-null int64
PAY_6        30000 non-null int64
BILL_AMT1    30000 non-null float64
BILL_AMT2    30000 non-null float64
BILL_AMT3    30000 non-null float64
BILL_AMT4    30000 non-null float64
BILL_AMT5    30000 non-null float64
BILL_AMT6    30000 non-null float64
PAY_AMT1     30000 non-null float64
PAY_AMT2     30000 non-null float64
PAY_AMT3     30000 non-null float64
PAY_AMT4     30000 non-null float64
PAY_AMT5     30000 non-null float64
PAY_AMT6     30000 non-null float64
default      30000 non-null int64
dtypes: float64(13), int64(11)
memory usage: 5.7 MB
</pre></div></div></div>

From my understanding, scikit-learn requires all data to be numerical and continuous or specifically coded as a categorical variable. The numerical part is not a problem since all of my data is coded numerically (like 0 for Married, 1 for not) but 3 of my variables (SEX, EDUCATION, and MARRIAGE) are nominal/ordinal and need to be encoded as categorical variables instead of int64 ones.

How do I use encode these 3 variables with scikit-learn's preprocessing module to properly feed these features into a model like Logistic Regression?

Thanks in advance, and please forgive the formatting (feel free to edit or recommend how I can properly include Jupyter Notebook output into a Stack Overflow post).


Solution

  • Categorical features need more attention in feature engineering, because features like Age, date etc are difficult to encode. There are many ways to encode these features, by analyzing, domain-knowledge and many more.

    There is a library category_encoders, which have many functionality to encode such features, by the use of statistics. More you can find here.

    Here, is another good resource, that will shows you the use of encoding method by an example.