Search code examples
xgboostazure-machine-learning-service

Strange algorithm selection when using Azure AutoML with XBoostClassifier on categorial data


I have a data model consisting only of categorial features and a categorial label.

So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax). I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)

Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective. So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?

Thanks in advance.


Solution

  • TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.

    Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:

    In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.

    Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()

    I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.

    If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines