Search code examples
pythonmachine-learningxgboost

Getting categorical related error when trying to fit XGBoost model when there are no categorical cols


I have a data frame with the following columns dtype

{Int64Dtype(), UInt8Dtype(), dtype('float64'), dtype('int64')}

when I'm trying to fit xgb.XGBClassifier() I'm getting following error

ValueError: DataFrame.dtypes for data must be int, float, bool or category.  When
categorical type is supplied, DMatrix parameter `enable_categorical` must
be set to `True`. Invalid columns: NAME OF COLS THAT ARE UINT TYPE

Solution

  • Here's the code which triggers the warning:

    def _invalid_dataframe_dtype(data: DataType) -> None:
        # pandas series has `dtypes` but it's just a single object
        # cudf series doesn't have `dtypes`.
        if hasattr(data, "dtypes") and hasattr(data.dtypes, "__iter__"):
            bad_fields = [
                str(data.columns[i])
                for i, dtype in enumerate(data.dtypes)
                if dtype.name not in _pandas_dtype_mapper
            ]
            err = " Invalid columns:" + ", ".join(bad_fields)
        else:
            err = ""
    
        type_err = "DataFrame.dtypes for data must be int, float, bool or category."
        msg = f"""{type_err} {_ENABLE_CAT_ERR} {err}"""
        raise ValueError(msg)
    

    (Source.)

    It references another variable, _pandas_dtype_mapper, which is used to decide how to match each datatype. Here's how that is defined:

    _pandas_dtype_mapper = {
        'int8': 'int',
        'int16': 'int',
        'int32': 'int',
        'int64': 'int',
        'uint8': 'int',
        'uint16': 'int',
        'uint32': 'int',
        'uint64': 'int',
        'float16': 'float',
        'float32': 'float',
        'float64': 'float',
        'bool': 'i',
        # nullable types
        "Int16": "int",
        "Int32": "int",
        "Int64": "int",
        "boolean": "i",
    }
    

    (Source.)

    So, here we find the problem. It supports a uint datatype. It supports a nullable datatype. But it doesn't seem to support a nullable uint datatype.

    This suggests two possible workarounds:

    1. Use int instead of uint.
    2. Fill in your null values in that column, and convert that column to a non-nullable datatype.