Search code examples
python-3.xpandasscale

Assessing features to labelencode or get_dummies() on dataset in Python


I'm working on the heart attack analysis on Kaggle in python. I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).

age: age in years

sex: (1 = male; 0 = female)

cp: ordinal values 1-4

thalach: maximum heart rate achieved

exang: (1 = yes; 0 = no)

oldpeak: depression induced by exercise

slope: the slope of the peak exercise

ca: values (0-3)

thal: ordinal values 0-3

target: 0= less chance, 1= more chance

Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?

I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?


Solution

  • When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.

    Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.

    Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference. 95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score