Search code examples
rmachine-learninglinear-regressionr-caretlm

Error: inputs must be factors - although I'm using linear regression and all features are numeric


I am trying to predict ages using linear regression in R. Basically I am using gene expression data to predict the ages, so the columns you see here are genes.

Here is a small subset of the data (the original data not the train, called age_pred):

structure(list(age = c(47, 39, 37, 8, 42, 45, 49, 43, 39, 48), 
    HNRNPA0 = c(29.73446, 29.92989, 31.95408, 32.08738, 30.9989, 
    31.73896, 30.79453, 31.47219, 31.81943, 30.88048), ABHD2 = c(32.9946265323029, 
    32.7362770559135, 34.331705505806, 33.7107749955508, 33.4347574459267, 
    34.5282535270287, 33.8085246495487, 33.4646375518867, 33.4936237157377, 
    32.3604653643843), CYB5R3 = c(35.58433, 35.56673, 37.35725, 
    35.05798, 35.36807, 36.20249, 34.61598, 36.41034, 37.95884, 
    35.03965), RPRD2 = c(32.80401, 34.05659, 34.20036, 33.90712, 
    33.21673, 33.75369, 33.64168, 34.37718, 32.62894, 32.84124
    ), GRINA = c(35.02339, 34.49548, 35.43786, 35.73121, 34.2059, 
    34.6569, 33.86705, 35.63485, 34.88564, 34.44139), SEC61A1 = c(34.32433, 
    35.17745, 35.93087, 35.91407, 35.04778, 34.98187, 34.6524, 
    36.05048, 35.16417, 33.89892), HSPA5 = c(32.983, 33.15406, 
    35.41871, 35.88919, 34.10364, 34.23049, 33.81859, 35.34636, 
    34.51912, 33.10022), ARF3 = c(33.7404667070002, 32.4284787643714, 
    34.9797780950407, 35.5112520700914, 33.5425535496703, 34.5253494533377, 
    33.8143672021478, 34.1443535341306, 34.8727981424934, 33.7736424939363
    ), LAMC1 = c(33.58156, 34.4972, 36.386, 35.24869, 35.20215, 
    35.89395, 35.654, 36.31492, 34.99312, 35.20289)), row.names = c("EA595454", 
"EA595500", "EA595522", "EA595529", "EA595597", "EA595624", "EA595632", 
"EA595635", "EA595647", "EA595654"), class = "data.frame")

Code:

DEX = createDataPartition(y = age_pred$age, p=0.8, list = FALSE)
age_trn = age_pred[DEX, ]
age_tst = age_pred[-DEX,]
    
ctrlCV = trainControl(method = 'cv', number = 5 , classProbs = FALSE , savePredictions = TRUE, summaryFunction = twoClassSummary )

ageModel <- caret::train(age ~ ., data = age_trn, 
                           method = 'lm',
                           trControl = ctrlCV)

And the error:

Error in sensitivity.default(data[, "pred"], data[, "obs"], lev[1]) : 
  inputs must be factors

using the glimpse(age_pred) function, all features in the data are type dbl. Here are some of them:

$ age      <dbl> 61, 59, 30, 64, 67, 71, 65, 61, 70, 48, 64, 77, 73, 40, 58, 62, 79, 53, 60, 68, 71, 52, 54, 50, 70, 53, 67, 67, 71, 72, 54,…
$ HNRNPA0  <dbl> 29.92989, 31.95408, 32.08738, 30.99890, 31.73896, 30.79453, 31.47219, 31.81943, 30.88048, 31.83250, 32.70315, 32.06897, 30.…
$ ABHD2    <dbl> 32.73628, 34.33171, 33.71077, 33.43476, 34.52825, 33.80852, 33.46464, 33.49362, 32.36047, 34.25793, 34.30586, 33.86784, 32.…
$ CYB5R3   <dbl> 35.56673, 37.35725, 35.05798, 35.36807, 36.20249, 34.61598, 36.41034, 37.95884, 35.03965, 36.54919, 36.39444, 34.95226, 35.…
$ RPRD2    <dbl> 34.05659, 34.20036, 33.90712, 33.21673, 33.75369, 33.64168, 34.37718, 32.62894, 32.84124, 33.39123, 34.20990, 33.00906, 32.…
$ GRINA    <dbl> 34.49548, 35.43786, 35.73121, 34.20590, 34.65690, 33.86705, 35.63485, 34.88564, 34.44139, 35.44804, 35.09964, 34.30946, 34.…
$ SEC61A1  <dbl> 35.17745, 35.93087, 35.91407, 35.04778, 34.98187, 34.65240, 36.05048, 35.16417, 33.89892, 35.25823, 34.81930, 34.82199, 34.…
$ HSPA5    <dbl> 33.15406, 35.41871, 35.88919, 34.10364, 34.23049, 33.81859, 35.34636, 34.51912, 33.10022, 34.17081, 35.64166, 34.21163, 33.…
$ ARF3     <dbl> 32.42848, 34.97978, 35.51125, 33.54255, 34.52535, 33.81437, 34.14435, 34.87280, 33.77364, 34.44382, 34.84120, 33.96720, 32.…
$ LAMC1    <dbl> 34.49720, 36.38600, 35.24869, 35.20215, 35.89395, 35.65400, 36.31492, 34.99312, 35.20289, 35.34522, 35.51326, 35.87105, 35.…
$ MBD3     <dbl> 27.91208, 29.42368, 27.11015, 28.48502, 29.10552, 29.30748, 27.87883, 30.76615, 26.77972, 29.42166, 27.70776, 32.48756, 34.…

I don't understand, why it wants inputs to be factors, it doesn't make sense, linear regression needs numeric values!

What is causing this error? is my code faulty anywhere?


Solution

  • The problem was with the cross-validation parameters:

    ctrlCV = trainControl(method = 'cv', number = 5 , classProbs = FALSE , savePredictions = TRUE, summaryFunction = twoClassSummary )
    

    summaryFunction = twoClassSummary did this error.. in case anyone faces such a problem in the future.