My train and test dataset's features/variables are initially in order and with matching names, but after I use the .get_dummies() method to convert my categorical variables to binary variables to run logistic regression, it causes an ordering issue. The categorical variable causing issues is the 'Dependents' feature that is either '1', '2', or '3'. The get_dummies() method creates 3 different features ('Dependents_0', 'Dependents_1', 'Dependents_2', and 'Dependents_3').
In the train dataset, for some reason orders it as so: 'Dependents_3', 'Dependents_0', 'Dependents_1', 'Dependents_2'
The test dataset has them in the correct order. Because of this, I believe its causing issue when trying to run my model on the test dataset, as I get the warning:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.
warnings.warn(message, FutureWarning)
Other information regarding the dataset after the get_dummies() method is called:
=> train_ds.dtypes
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Loan_Status int64
Gender_Female uint8
Gender_Male uint8
Married_No uint8
Married_Yes uint8
Dependents_3 uint8
Dependents_0 uint8
Dependents_1 uint8
Dependents_2 uint8
Education_Graduate uint8
Education_Not Graduate uint8
Self_Employed_No uint8
Self_Employed_Yes uint8
Property_Area_Rural uint8
Property_Area_Semiurban uint8
Property_Area_Urban uint8
dtype: object
=> test_ds.dtypes
ApplicantIncome int64
CoapplicantIncome int64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Gender_Female uint8
Gender_Male uint8
Married_No uint8
Married_Yes uint8
Dependents_0 uint8
Dependents_1 uint8
Dependents_2 uint8
Dependents_3 uint8
Education_Graduate uint8
Education_Not Graduate uint8
Self_Employed_No uint8
Self_Employed_Yes uint8
Property_Area_Rural uint8
Property_Area_Semiurban uint8
Property_Area_Urban uint8
dtype: object
You can use the columns
attribute from the train dataframe to reorder the test dataframe columns :
test_ds[train_ds.columns]