Search code examples
pythonc++xgboostmql4

How to produce XGBoost outputs without an XGBoost library?


I have an XGBoost binary classifier model trained in Python.

I would like to produce outputs from this model for a new input data, in a different scripting environment (MQL4), using pure mathematical operations and without using XGBoost library (.predict).

Can anyone help with the formula and/or algorithm?


Solution

  • After some reverse engineering, I found out how; once the model is trained dump your model into a text file first:

    num_round = 3
    bst = xgb.train( param, dtrain, num_round, watchlist )    
    bst.dump_model('D:/Python/classifyproduct.raw.txt')
    

    Then for each booster find the leaf probabilities for the input feature set. Sum all these probabilities and in our case, feed into binary logistic function:

    1/(1+exp(-sum))
    

    This is the output probability of the trained xgboost model for a given input feature set. As for an example, my sample dump with 2 inputs (a and b) text file was:

    booster[0]:
    0:[b<-1] yes=1,no=2,missing=1
    1:[a<0] yes=3,no=4,missing=3
        3:[a<-2] yes=7,no=8,missing=7
            7:leaf=0.522581
            8:[b<-3] yes=13,no=14,missing=13
                13:leaf=0.428571
                14:leaf=-0.333333
        4:leaf=-0.54
    2:[a<2] yes=5,no=6,missing=5
        5:[a<-8] yes=9,no=10,missing=9
            9:leaf=-0.12
            10:leaf=-0.56129
        6:[b<2] yes=11,no=12,missing=11
            11:leaf=-0.495652
            12:[a<4] yes=15,no=16,missing=15
                15:[b<7] yes=17,no=18,missing=17
                    17:leaf=-0.333333
                    18:leaf=0.333333
                16:leaf=0.456
    booster[1]:
    0:[b<-1] yes=1,no=2,missing=1
    1:[a<0] yes=3,no=4,missing=3
        3:[b<-3] yes=7,no=8,missing=7
            7:leaf=0.418665
            8:[a<-3] yes=13,no=14,missing=13
                13:leaf=0.334676
                14:leaf=-0.282568
        4:leaf=-0.424174
     2:[a<2] yes=5,no=6,missing=5
        5:[b<0] yes=9,no=10,missing=9
            9:leaf=-0.048659
            10:leaf=-0.445149
        6:[b<2] yes=11,no=12,missing=11
            11:leaf=-0.394495
            12:[a<5] yes=15,no=16,missing=15
                15:[b<7] yes=17,no=18,missing=17
                    17:leaf=-0.330064
                    18:leaf=0.333063
                16:leaf=0.392826
    booster[2]:
    0:[b<-1] yes=1,no=2,missing=1
    1:[a<0] yes=3,no=4,missing=3
        3:[b<-3] yes=7,no=8,missing=7
            7:leaf=0.356906
            8:[a<-3] yes=13,no=14,missing=13
                13:leaf=0.289085
                14:leaf=-0.245992
        4:leaf=-0.363819
     2:[a<4] yes=5,no=6,missing=5
        5:[a<2] yes=9,no=10,missing=9
            9:[b<0] yes=15,no=16,missing=15
                15:leaf=-0.0403689
                16:leaf=-0.381402
            10:[b<7] yes=17,no=18,missing=17
                17:leaf=-0.307704
                18:leaf=0.239974
        6:[b<2] yes=11,no=12,missing=11
            11:leaf=-0.308265
            12:leaf=0.302142
    

    I have 2 features as inputs. Let us say we have [4, 9] as an input. We can calculate the booster probabilities as:

    booster0 : 0.456
    booster1 : 0.333063
    booster2 : 0.302142
    sum = 1.091205
    1/(1+exp(-sum)) = 0.748608563
    

    And that's it.