Search code examples
rmachine-learningboosting

different values by fitting a boosted tree twice


I use the R-package adabag to fit boosted trees to a (large) data set (140 observations with 3 845 predictors).

I executed this method twice with same parameter and same data set and each time different values of the accuracy returned (I defined a simple function which gives accuracy given a data set). Did I make a mistake or is usual that in each fitting different values of the accuracy return? Is this problem based on the fact that the data set is large?

function which returns accuracy given the predicted values and true test set values.

    err<-function(pred_d, test_d)
{
  abs.acc<-sum(pred_d==test_d)
  rel.acc<-abs.acc/length(test_d)

  v<-c(abs.acc,rel.acc)

  return(v)
}

new Edit (9.1.2017): important following question of the above context.

As far as I can see I do not use any "pseudo randomness objects" (such as generating random numbers etc.) in my code, because I essentially fit trees (using r-package rpart) and boosted trees (using r-package adabag) to a large data set. Can you explain me where "pseudo randomness" enters, when I execute my code?

Edit 1: Similar phenomenon happens also with tree (using the R-package rpart).

Edit 2: Similar phenomenon did not happen with trees (using rpart) on the data set iris.


Solution

  • There's no reason you should expect to get the same results if you didn't set your seed (with set.seed()).

    It doesn't matter what seed you set if you're doing statistics rather than information security. You might run your model with several different seeds to check its sensitivity. You just have to set it before anything involving pseudo randomness. Most people set it at the beginning of their code.

    This is ubiquitous in statistics; it affects all probabilistic models and processes across all languages.

    Note that in the case of information security it's important to have a (pseudo) random seed which cannot be easily guessed by brute force attacks, because (in a nutshell) knowing a seed value used internally by a security program paves the way for it to be hacked. In science and statistics it's the opposite - you and anyone you share your code/research with should be aware of the seed to ensure reproducibility.

    https://en.wikipedia.org/wiki/Random_seed

    http://www.grasshopper3d.com/forum/topics/what-are-random-seed-values