Does anyone know what the mechanism is that the R randomForest package uses to resolve classification ties - i.e. when the trees end up with equal votes in two or more classes?
The documentation says that the tie is broken randomly. However, when you train a model on a set of data and then score that model many times with a single set of validation data, the tied class decisions aren't 50/50.
cnum = vector("integer",1000)
for (i in 1:length(cnum)){
cnum[i] = (as.integer(predict(model,val_x[bad_ind[[1]],])))
}
cls = unique(cnum)
for (i in 1:length(cls)){
print(length(which(cnum == cls[i])))
}
where model
is the randomForest object and bad_ind
is just a list of indices for feature vectors that have tied class votes. In my test cases, using the code above, the distribution between two tied classes is closer to 90/10.
Also, the recommendation to use an odd number of trees doesn't normally work with a third class pulling some votes leaving two other classes in a tie.
Shouldn't these cases with the rf trees tied in voting end up 50/50?
Update: It is difficult to provide an example due to the random nature of training a forest, but the following code (sorry for the slop) should end up producing examples that the forest can't determine a clear winner with. My test runs shows a 66%/33% distribution when the ties are broken - I expected this to be 50%/50%.
library(randomForest)
x1 = runif(200,-4,4)
x2 = runif(200,-4,4)
x3 = runif(1000,-4,4)
x4 = runif(1000,-4,4)
y1 = dnorm(x1,mean=0,sd=1)
y2 = dnorm(x2,mean=0,sd=1)
y3 = dnorm(x3,mean=0,sd=1)
y4 = dnorm(x4,mean=0,sd=1)
train = data.frame("v1"=y1,"v2"=y2)
val = data.frame("v1"=y3,"v2"=y4)
tlab = vector("integer",length(y1))
tlab_ind = sample(1:length(y1),length(y1)/2)
tlab[tlab_ind]= 1
tlab[-tlab_ind] = 2
tlabf = factor(tlab)
vlab = vector("integer",length(y3))
vlab_ind = sample(1:length(y3),length(y3)/2)
vlab[vlab_ind]= 1
vlab[-vlab_ind] = 2
vlabf = factor(vlab)
mm <- randomForest(x=train,y=tlabf,ntree=100)
out1 <- predict(mm,val)
out2 <- predict(mm,val)
out3 <- predict(mm,val)
outv1 <- predict(mm,val,norm.votes=FALSE,type="vote")
outv2 <- predict(mm,val,norm.votes=FALSE,type="vote")
outv3 <- predict(mm,val,norm.votes=FALSE,type="vote")
(max(as.integer(out1)-as.integer(out2)));(min(as.integer(out1)-as.integer(out2)))
(max(as.integer(out2)-as.integer(out3)));(min(as.integer(out2)-as.integer(out3)))
(max(as.integer(out1)-as.integer(out3)));(min(as.integer(out1)-as.integer(out3)))
bad_ind = vector("list",0)
for (i in 1:length(out1)) {
#for (i in 1:100) {
if (out1[[i]] != out2[[i]]){
print(paste(i,out1[[i]],out2[[i]],sep = "; "))
bad_ind = append(bad_ind,i)
}
}
for (j in 1:length(bad_ind)) {
cnum = vector("integer",1000)
for (i in 1:length(cnum)) {
cnum[[i]] = as.integer(predict(mm,val[bad_ind[[j]],]))
}
cls = unique(cnum)
perc_vals = vector("integer",length(cls))
for (i in 1:length(cls)){
perc_vals[[i]] = length(which(cnum == cls[i]))
}
cat("for feature vector ",bad_ind[[j]]," the class distrbution is: ",perc_vals[[1]]/sum(perc_vals),"/",perc_vals[[2]]/sum(perc_vals),"\n")
}
Update: This should be fixed in version 4.6-3 of randomForest.
This should be fixed in version 4.6-3 of randomForest.