Search code examples
rxgboost

How to get predicted probabilities of being in each class of three using xgboost in R?


I am trying to train a xgboost model using iris dataset. The training code is shown below, and both prediction functions produce the same results. However, the length of the results is 135, while test_data has only 45 rows. In addition, the results seems to look like predicted probabilities, but there are 3 classes in the label, while the results only produce a vector instead of a matrix of predicted probabilities of three classes. So, how can I get the predicted probability for each class and also the predicted class?

data("iris")
iris$Species <- as.numeric(as.factor(iris$Species)) - 1

indexes <- caret::createDataPartition(iris$Species, p = .7, list = F)
train_data <- iris[indexes, ]
test_data <- iris[-indexes, ]

xgb.train <- xgb.DMatrix(data = as.matrix(train_data), label = train_data$Species)
xgb.test <- xgb.DMatrix(data = as.matrix(test_data), label = test_data$Species)

params = list("objective" = "multi:softprob", 
              "eval_metric" = "mlogloss",
              "num_class" = 3)

xgb.model <- xgboost::xgb.train(params = params, data = xgb.train, nrounds = 1000)
predict(xgb.model, newdata = xgb.test)
predict(xgb.model, newdata = xgb.test, type = "prob")

0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.985415220 0.008038994 0.006545801 0.977108896 0.016400522 0.006490625 0.985415220 0.008038994 0.006545801 0.008124468 0.983585954 0.008289632 0.005110676 0.989674747 0.005214573 0.003452316 0.993025184 0.003522499 0.005499140 0.988889933 0.005610934 0.011182932 0.977406859 0.011410273 0.005110676 0.989674747 0.005214573 0.011182932 0.977406859 0.011410273 0.011182932 0.977406859 0.011410273 0.003452316 0.993025184 0.003522499 0.010401487 0.978985548 0.010612942 0.005250969 0.005771303 0.988977730 0.005250969 0.005771303 0.988977730 0.005250969 0.005771303 0.988977730 0.005239322 0.007976402 0.986784279 0.005239322 0.007976402 0.986784279 0.005239322 0.007976402 0.986784279 0.005250969 0.005771303 0.988977730 0.005219116 0.011802264 0.982978642 0.005250969 0.005771303 0.988977730 0.005219116 0.011802264 0.982978642 0.005219116 0.011802264 0.982978642 0.005250969 0.005771303 0.988977730 0.005250969 0.005771303 0.988977730 0.005250969 0.005771303 0.988977730 0.005180326 0.019146746 0.975672841


Solution

  • The predicted values are concatenated into a vector. You can simply convert them into a matrix to get the predicted values for each class. Make sure byrow=TRUE since the values are for each class in turn.

    pred <- predict(xgb.model, newdata = xgb.test)
    
    pred <- matrix(pred, ncol=xgb.model$params$num_class, byrow=TRUE)
    head(pred)
    
              [,1]        [,2]        [,3]
    [1,] 0.9858927 0.007713272 0.006394033
    [2,] 0.9858927 0.007713272 0.006394033
    [3,] 0.9858927 0.007713272 0.006394033
    [4,] 0.9858927 0.007713272 0.006394033
    [5,] 0.9858927 0.007713272 0.006394033
    [6,] 0.9790474 0.014603053 0.006349638