Search code examples
rrandom-forestr-caretfeature-selectionrfe

Error in Feature selection with Recursive feature elimination in random forest model


I have some hundred samples and I have already classified them into four different classes (clusters). Now, I'm interested in identifying the best set of genes that classify the samples into different classes.

I want to apply randomforest with recursive feature elimination and detect the genes (features). My data looks like below. Just posting some example data here.

enter image description here

Above data is just an example: My original data is in dataframe df with 100 samples in first column and 4 classes in second column and columns 3 to column 1002 there are in total 1000 genes with expression values.

I'm using the below code but I see there is an error.

library(caret)
library(mlbench)
library(Hmisc)
library(randomForest)

# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm
results <- rfe(df[,3:1002], df[,2], sizes = df[,1:1002], rfeControl=control)

There is an Error: And I feel that I'm doing wrong somewhere.

Error in summary.connection(connection) : invalid connection

Here I'm giving the dput of the above data.

df <-structure(list(Samples = structure(c(1L, 8L, 9L, 10L, 11L, 12L, 
13L, 14L, 15L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("Sample1", 
"Sample10", "Sample11", "Sample12", "Sample13", "Sample14", "Sample15", 
"Sample2", "Sample3", "Sample4", "Sample5", "Sample6", "Sample7", 
"Sample8", "Sample9"), class = "factor"), Class = structure(c(1L, 
2L, 3L, 1L, 2L, 4L, 2L, 1L, 1L, 4L, 1L, 3L, 4L, 1L, 1L), .Label = c("Class1", 
"Class2", "Class3", "Class4"), class = "factor"), Gene1 = c(1.030078784, 
0.944152632, 0.140700452, 0.013432323, 0.265233165, -0.084496727, 
4.835469554, 0.089434913, -0.433436179, 1.462895475, -0.116005356, 
1.007868422, 0.244881864, -1.495666899, 0.364368654), Gene2 = c(1.407236415, 
1.229003431, -0.322221459, -1.361955252, 0.310963955, 0.80115063, 
4.27765356, 0.872413223, -0.568249851, 1.187873069, -0.255284575, 
1.878058722, -0.767371822, -0.859697473, 0.057304769), Gene3 = c(0.200772234, 
-0.048349737, 1.224274924, 0.492396142, 0.500786902, -0.731802706, 
1.853246564, 1.611995455, 0.287088678, 0.509235514, 2.031735375, 
3.074950771, 2.069407179, 0.886158642, 1.736798303), Gene4 = c(1.23309207, 
1.321282889, 2.403301108, 0.748860637, 1.019200751, 1.393254607, 
2.667976275, 1.158136576, 1.89503732, 2.178257717, 0.747697632, 
2.834410716, 0.028594536, -0.411039831, 1.100167946), Gene5 = c(0.883005616, 
0.570786704, 0.72649548, 4.705893892, 0.086345885, 0.502530136, 
2.681497202, 0.640362079, 0.327319762, 2.086767741, 1.853085301, 
1.001799748, 0.126208601, 0.911621722, 0.671191951), Gene6 = c(2.590519025, 
3.076688902, 1.77414005, 1.014363629, 1.134652225, 2.71957962, 
4.696379063, -0.301828123, 1.214261665, 2.413881644, -0.470794827, 
0.520494891, 0.194511306, 0.075331863, 2.315680177), Gene7 = c(0.088929673, 
0.472549468, -0.125630236, -0.069648505, -0.715250242, 0.068554966, 
4.131662998, -0.075265565, -1.234425917, 0.343350342, 0.190414782, 
1.153495806, 0.210317581, -0.475603641, 0.294299351), Gene8 = c(2.112231178, 
2.780100532, 2.423828553, 1.569215682, 1.018119196, 2.583413401, 
6.483053565, 2.215201821, 1.893325529, 2.342058862, 4.001423142, 
4.221704757, 1.978211867, 1.452633851, 2.556589741)), class = "data.frame", row.names = c(NA, 
-15L))

Can anyone please tell me how I can use the above data and apply random forest to know which genes classify the samples into different classes. thanq.


Solution

  • sizes refer to the number of features you would like to try and retain, it should numeric but you provided something weird in df[,1:1002].

    See something like below, where i simulate a dataset and setting the sizes correctly ensures it runs along to choose the optimal number of features (from what you provide):

    library(caret)
    library(mlbench)
    library(Hmisc)
    library(randomForest) 
    
    set.seed(101)
    df = data.frame(samples=paste0("Samples",1:99),
                    Class=paste0("Class",rep(1:3,33)),
                    matrix(rnorm(99*1000),ncol=1000))
    
    colnames(df)[3:ncol(df)]=paste0("Gene",1:1000)
    
    # we create like 100 informative genes for Class1 and Class2
    df[df$Class=="Class1",3:103] = df[df$Class=="Class1",3:103] + rpois(33*100,1.5)
    df[df$Class=="Class2",104:203] = df[df$Class=="Class2",104:203] + rpois(33*100,1.5)
    
    control <- rfeControl(functions=rfFuncs, method="cv", number=2)
    
    # run the RFE algorithm
    results <- rfe(df[,3:1002], df[,2], sizes = c(50,100,200), 
                   rfeControl=control)
    

    From the above, I ask for 50,100 or 200 informative features and I get:

    results
    Recursive feature selection
    
    Outer resampling method: Cross-Validated (2 fold) 
    
    Resampling performance over subset size:
    
     Variables Accuracy  Kappa AccuracySD KappaSD Selected
            50   0.9792 0.9688    0.02946 0.04419         
           100   0.9896 0.9844    0.01473 0.02210         
           200   1.0000 1.0000    0.00000 0.00000        *
          1000   1.0000 1.0000    0.00000 0.00000         
    
    The top 5 variables (out of 200):
       Gene94, Gene198, Gene137, Gene136, Gene158
    
    > results$optsize
    [1] 200