Im currently practicing R on the Kaggle using the titanic data set I am using the Random Forest Algorthim
Below is the code
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age_Bucket + Embarked
+ Age_Bucket + Fare_Bucket + F_Name + Title + FamilySize + FamilyID,
data=train, importance=TRUE, ntree=5000)
I am getting the following error
Error in randomForest.default(m, y, ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In data.matrix(x) : NAs introduced by coercion
4: In data.matrix(x) : NAs introduced by coercion
My data looks like below
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1...
$ Age_Bucket : chr "20-25" "30-40" "25-30" "30-40" ...
$ Fare_Bucket: chr "<10" "30+" "<10" "30+" ...
$ Title : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ F_Name : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ FamilyID : chr "Small" "Small" "Alone" "Small" ...
If i just type the below, I have no coercion issues which as far as i can see is the only place where coercion occurs to create NA values
Can anyone see the problem
Thank you for your time
You need to convert your char
columns into factors. Factors are treated as integers internally whereas character fields are not. See the following small demonstration:
df <- data.frame(y = sample(0:1, 26, rep=T), x1=runif(26), x2=letters, stringsAsFactors=F)
df$y <- as.factor(df$y)
> str(df)
'data.frame': 26 obs. of 3 variables:
$ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
$ x1: num 0.457 0.296 0.517 0.478 0.764 ...
$ x2: chr "a" "b" "c" "d" ...
Now if I run my randomForest
> randomForest(y ~ x1 + x2, data=df)
Error in randomForest.default(m, y, ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In data.matrix(x) : NAs introduced by coercion
I get the same error you did.
Whereas if I convert the char
column into factor
df$x2 <- as.factor(df$x2)
> randomForest(y ~ x1 + x2, data=df)
randomForest(formula = y ~ x1 + x2, data = df)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 61.54%
Confusion matrix:
0 1 class.error
0 0 16 1
1 0 10 0
It works great!