Search code examples
machine-learninglogistic-regressiondata-science

Machine learning, Do unbalanced non-numeric variable classes matter


If I have a non-numeric variable in my data set that contains many of one class but few of another does this cause the same issues as when the target classes are unbalanced?

For example if one of my variables was title and the aim was to identify whether a person is obese. The data obese class is split 50:50 but there is only one row with the title 'Duke' and this row is in the obese class. Does this mean that an algorithm like logistic regression (after numeric encoding) would start predicting that all Dukes are obese (or have a disproportionate weighting for the title 'Duke')? If so, are some algorithms better/worse at handling this case? Is there a way to prevent this issue?


Solution

  • Yes, any vanilla machine learning algorithm will treat categorical data the same way as numerical data in terms of information entropy from a specific feature.

    Consider this, before applying any machine learning algorithm you should analyze your input features and identify the explained variance each cause on the target. In your case if the label Duke always gets identified as obese, then given that specific dataset that is an extremely high information feature and should be weighted as such.

    I would mitigate this issue by adding a weight to that feature, thus minimizing the impact it will have on the target. However, this would be a shame if this is an otherwise very informative feature for other instances.

    An algorithm which could easily circumvent this problem is random forest (decision trees). You can eliminate any rule that is based on this feature being Duke.

    Be very careful in mapping this feature to numbers as this will have an impact on the importance attributed to this feature with most algorithms.