Search code examples
pythonpandaslogistic-regression

Pandas dataset features in wrong order for logistic regression


My train and test dataset's features/variables are initially in order and with matching names, but after I use the .get_dummies() method to convert my categorical variables to binary variables to run logistic regression, it causes an ordering issue. The categorical variable causing issues is the 'Dependents' feature that is either '1', '2', or '3'. The get_dummies() method creates 3 different features ('Dependents_0', 'Dependents_1', 'Dependents_2', and 'Dependents_3').

In the train dataset, for some reason orders it as so: 'Dependents_3', 'Dependents_0', 'Dependents_1', 'Dependents_2'

The test dataset has them in the correct order. Because of this, I believe its causing issue when trying to run my model on the test dataset, as I get the warning:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)

Other information regarding the dataset after the get_dummies() method is called:

=> train_ds.dtypes
ApplicantIncome              int64
CoapplicantIncome          float64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Loan_Status                  int64
Gender_Female                uint8
Gender_Male                  uint8
Married_No                   uint8
Married_Yes                  uint8
Dependents_3                 uint8
Dependents_0                 uint8
Dependents_1                 uint8
Dependents_2                 uint8
Education_Graduate           uint8
Education_Not Graduate       uint8
Self_Employed_No             uint8
Self_Employed_Yes            uint8
Property_Area_Rural          uint8
Property_Area_Semiurban      uint8
Property_Area_Urban          uint8
dtype: object


=> test_ds.dtypes
ApplicantIncome              int64
CoapplicantIncome            int64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Gender_Female                uint8
Gender_Male                  uint8
Married_No                   uint8
Married_Yes                  uint8
Dependents_0                 uint8
Dependents_1                 uint8
Dependents_2                 uint8
Dependents_3                 uint8
Education_Graduate           uint8
Education_Not Graduate       uint8
Self_Employed_No             uint8
Self_Employed_Yes            uint8
Property_Area_Rural          uint8
Property_Area_Semiurban      uint8
Property_Area_Urban          uint8
dtype: object

Solution

  • You can use the columns attribute from the train dataframe to reorder the test dataframe columns :

    test_ds[train_ds.columns]