I'm debugging a code with Random Forest package, with barely no previous R experience.
I've reached a point where, excecuting predict.randomForest
, I get the error:
New factor levels not present in the training data.
Searching this site I've found the reason and understood that I need to delete the records that are causing the problem.
How can I isolate (find out) which columns/rows are causing the problems?
Assume you have train.data, which you used to build your model, test.data, which you now want to get predictions for, and your factor variable factor.var1, then you could do:
levels(test.data$factor.var1) %in% levels(train.data$factor.var1)
Which will produce a logical vector corresponding to the factor levels in test.data, with the "FALSE" entries being the factor levels that were not present in your train.data.