Search code examples
pandasdataframemachine-learningfeature-selection

Feature preprocessing - Large categorial multivalue feature


I'm barely scratching the surface of Machine learning. I have gotten my hands on some real data from a hospital where I want to predict a score between 1-6 based on various data in the medical chart. Based on research I found, multiple others are suggesting SVM for this task, so that is what i'm going for as well.

One of the features are Diagnosis. This feature contains a delimited list of diagnosis codes. Each patient can have a list between 1 and aprrox. 20 diagnosis codes. Some of them should in theory have a strong impact on the score the patient gets. lets just say it will be in the format DE280,BA234,DG4234 etc. With 30.000 patients, this could lead to an immense set of features if i would headless try to OneHotEncode it (8759 to be exact). So what would my best option be to fit a Linear SVC model without getting hit by a "curse of dimensionality"?


Solution

  • You do have many options and there are many algorithms/ways to reduce the features.
    If you are working with Python, Many of these algorithms are implemented in scikit-learn package. For example SelectPercentile will select best features based on univariate statistical tests.
    Another way is to use SelectFromModel. You can fit your Linear SVC model and then find out which features have the most effect on the model.
    Beside these, you can test features in a cross-validation like method. In each fold use a few number of features and find the best one using the above methods. Then combine these features and