Search code examples
pythoncatboost

How to export CatBoost model to text for future parsing to if-else decision tree?


I'm currently using new CatBoost algorithm (python version) and trying to export my model to txt file to transfer my model to C/Java implementation. Looking into documentation I have only found save_model method which is only accept two formats of file: 1. binary 2. CoreML for Apple

None of this formats is suitable for me, so maybe there is other way to achieve it?


Solution

  • There is no way to do this directly: Catboost doesn't support model serialization so far.

    However, Catboost already can transform models to CoreML, and there is a CoreML tool to serialize models to a JSON-like text. Enjoy the minimal example:

    from sklearn import datasets
    iris = datasets.load_iris()
    
    import catboost
    # the shortest possible model specification
    cls = catboost.CatBoostClassifier(loss_function='MultiClass', iterations=1, depth=1)
    cls.fit(iris.data, iris.target)
    
    # save model to CoreML format
    cls.save_model(
        "iris.mlmodel",
        format="coreml", 
        export_parameters={
            'prediction_type': 'probability'
        }
    )
    
    # there is a CoreML tool for model serialization
    import coremltools
    model = coremltools.models.model.MLModel("iris.mlmodel")
    model.get_spec()
    

    You probably need to read coremltools documentation to fully understand what this code prints, but you can read the output like this: "There is an ensemble of a single tree with 2 leaves - in the leaf 0, class 0 dominates, in the leaf 1 - classes 1 and 2. Go to the leaf 1, if feature 3 is larger than 0.8, otherwise go to leaf 0"

    specificationVersion: 1
    description {
      input {
        name: "feature_3"
        type {
          doubleType {
          }
        }
      }
      output {
        name: "prediction"
        type {
          multiArrayType {
            shape: 3
            dataType: DOUBLE
          }
        }
      }
      predictedFeatureName: "prediction"
      predictedProbabilitiesName: "prediction"
      metadata {
        shortDescription: "Catboost model"
        versionString: "1.0.0"
        author: "Mr. Catboost Dumper"
      }
    }
    treeEnsembleRegressor {
      treeEnsemble {
        nodes {
          nodeBehavior: LeafNode
          evaluationInfo {
            evaluationValue: 0.05084745649058943
          }
          evaluationInfo {
            evaluationIndex: 1
            evaluationValue: -0.025423728245294732
          }
          evaluationInfo {
            evaluationIndex: 2
            evaluationValue: -0.025423728245294732
          }
        }
        nodes {
          nodeId: 1
          nodeBehavior: LeafNode
          evaluationInfo {
            evaluationValue: -0.02752293516463098
          }
          evaluationInfo {
            evaluationIndex: 1
            evaluationValue: 0.01376146758231549
          }
          evaluationInfo {
            evaluationIndex: 2
            evaluationValue: 0.013761467582315471
          }
        }
        nodes {
          nodeId: 2
          nodeBehavior: BranchOnValueGreaterThan
          branchFeatureIndex: 3
          branchFeatureValue: 0.800000011920929
          trueChildNodeId: 1
        }
        numPredictionDimensions: 3
        basePredictionValue: 0.0
        basePredictionValue: 0.0
        basePredictionValue: 0.0
      }
      postEvaluationTransform: Classification_SoftMax
    }
    

    There is one downside to this approach: CoreML doesn't support the way Catboost works with categorical features. So if you want to serialize a model with categorical features, you need to one-hot-encode them before training.