I'm trying to work around the randomForest package limit of 32 levels for factors.
I have a data set with 100 levels in one of the factor variables.
I wrote the following code to see what things would look like using sampling with replacement and how many tries it would take to get certain % of levels selected.
sampAll <- c()
nums1 <- seq(1,102,1)
for(i in 1:20){
samp1 <- sample(nums1, 32)
sampAll <- unique(cbind(sampAll, samp1))
outSamp1 <- nums1[-(sampAll[,1:ncol(sampAll)])]
print(paste(i, " | Remaining: ",length(outSamp1)/102,sep=""))
flush.console()
}
[1] "1 | Remaining: 0.686274509803922"
[1] "2 | Remaining: 0.490196078431373"
[1] "3 | Remaining: 0.333333333333333"
[1] "4 | Remaining: 0.254901960784314"
[1] "5 | Remaining: 0.215686274509804"
[1] "6 | Remaining: 0.147058823529412"
[1] "7 | Remaining: 0.117647058823529"
[1] "8 | Remaining: 0.0980392156862745"
[1] "9 | Remaining: 0.0784313725490196"
[1] "10 | Remaining: 0.0784313725490196"
[1] "11 | Remaining: 0.0490196078431373"
[1] "12 | Remaining: 0.0294117647058824"
[1] "13 | Remaining: 0.0196078431372549"
[1] "14 | Remaining: 0.00980392156862745"
[1] "15 | Remaining: 0.00980392156862745"
[1] "16 | Remaining: 0.00980392156862745"
[1] "17 | Remaining: 0.00980392156862745"
[1] "18 | Remaining: 0"
[1] "19 | Remaining: 0"
[1] "20 | Remaining: 0"
What I'm debating is whether to sample with or without replacement.
I'm thinking about:
I'm curious if anyone has tried something like this or if I'm breaking any rules (introducing bias, etc.) or if anyone has any suggestions.
NOTE: I've cross-posted this question on Stats-Overflow / Cross-Validated as well.
I could recommend 2 ways:
You can transform you 100-level variable into 100 binary variables. Each of them will represent one original level (0 - false, 1 - true). Thus you will be able to work with the whole dataset and make random forest model as well. But in this case the memory consumption by your dataset will increase and you will probably need to use some additional packages for working with huge datasets.
Second posibility is to make many samples of your original dataset with replacement. Because if you will split dataset without replacement you will have a bias in the model. But nevertheless I think you will need to make much more than 10-15 splits to avoid bias. I can not say how many exactly. Maybe around several hundreds or more. It depends on your dataset. Because if number of objects of each out of 100 levels is significantly different, then after spliting you will receive samples of significantly different size, and it can affect predictive ability of the model. In such a case number of splits should be increased.