How to export CatBoost model to text for future parsing to if-else decision tree?

I'm currently using new CatBoost algorithm (python version) and trying to export my model to txt file to transfer my model to C/Java implementation. Looking into documentation I have only found save_model method which is only accept two formats of file: 1. binary 2. CoreML for Apple

None of this formats is suitable for me, so maybe there is other way to achieve it?

Solution

There is no way to do this directly: Catboost doesn't support model serialization so far.

However, Catboost already can transform models to CoreML, and there is a CoreML tool to serialize models to a JSON-like text. Enjoy the minimal example:

from sklearn import datasets
iris = datasets.load_iris()

import catboost
# the shortest possible model specification
cls = catboost.CatBoostClassifier(loss_function='MultiClass', iterations=1, depth=1)
cls.fit(iris.data, iris.target)

# save model to CoreML format
cls.save_model(
    "iris.mlmodel",
    format="coreml", 
    export_parameters={
        'prediction_type': 'probability'
    }
)

# there is a CoreML tool for model serialization
import coremltools
model = coremltools.models.model.MLModel("iris.mlmodel")
model.get_spec()

You probably need to read coremltools documentation to fully understand what this code prints, but you can read the output like this: "There is an ensemble of a single tree with 2 leaves - in the leaf 0, class 0 dominates, in the leaf 1 - classes 1 and 2. Go to the leaf 1, if feature 3 is larger than 0.8, otherwise go to leaf 0"

specificationVersion: 1
description {
  input {
    name: "feature_3"
    type {
      doubleType {
      }
    }
  }
  output {
    name: "prediction"
    type {
      multiArrayType {
        shape: 3
        dataType: DOUBLE
      }
    }
  }
  predictedFeatureName: "prediction"
  predictedProbabilitiesName: "prediction"
  metadata {
    shortDescription: "Catboost model"
    versionString: "1.0.0"
    author: "Mr. Catboost Dumper"
  }
}
treeEnsembleRegressor {
  treeEnsemble {
    nodes {
      nodeBehavior: LeafNode
      evaluationInfo {
        evaluationValue: 0.05084745649058943
      }
      evaluationInfo {
        evaluationIndex: 1
        evaluationValue: -0.025423728245294732
      }
      evaluationInfo {
        evaluationIndex: 2
        evaluationValue: -0.025423728245294732
      }
    }
    nodes {
      nodeId: 1
      nodeBehavior: LeafNode
      evaluationInfo {
        evaluationValue: -0.02752293516463098
      }
      evaluationInfo {
        evaluationIndex: 1
        evaluationValue: 0.01376146758231549
      }
      evaluationInfo {
        evaluationIndex: 2
        evaluationValue: 0.013761467582315471
      }
    }
    nodes {
      nodeId: 2
      nodeBehavior: BranchOnValueGreaterThan
      branchFeatureIndex: 3
      branchFeatureValue: 0.800000011920929
      trueChildNodeId: 1
    }
    numPredictionDimensions: 3
    basePredictionValue: 0.0
    basePredictionValue: 0.0
    basePredictionValue: 0.0
  }
  postEvaluationTransform: Classification_SoftMax
}

There is one downside to this approach: CoreML doesn't support the way Catboost works with categorical features. So if you want to serialize a model with categorical features, you need to one-hot-encode them before training.