Search code examples
rlogistic-regressioncross-validationmultinomial

Runing into error while predicting the model bulid in cv.glm on unbalanced test and training data


I have a model predicted using logistic regression using cv.glm on a training dataset and when I predict it on testdata and try to generate a confusion matrix it is throwing error.The classes of both train and testdata set are unbalanced.

Here are the dimensions of both test and train datasets. Both my traindata and testdata is from a big dataset of 1234 columns and 60 rows I split it randomly into two sets to check the sensitivity and specificity of classfication at the end.

> dim(traindata)
   40 1234
> dim(testdata)
[1]   20 1234

And this is what I tried.

Subtype   = factor(traindata$Subtype) 
CV=cv.glmnet(x=data.matrix(traindata),y=Subtype,standardize=TRUE,alpha=0,nfolds=3,family="multinomial")
response_predict=predict(CV, data.matrix(testdata),type="response")
predicted = as.factor(names(response_predict)[1:3][apply(response_predict[1:3], 1, which.max)])

Here it throws error as:

Error in apply(response_predict[1:3], 1, which.max) : 
  dim(X) must have a positive length

My question is with to proceed in such unbalanced dataset using cv.glmand how to get rid of the above mentioned error. Thank you


Solution

  • Unbalancedness has nothing to do with this error. First, response_predict is an array, not a matrix and not a data frame. For this reason, the last line should be

    predicted <- as.factor(colnames(response_predict[, , 1])[1:3][apply(response_predict[, 1:3, 1], 1, which.max)])
    

    That is, since we are dealing with a three dimensional array, we have three indices. Also response_predict[1:3] meant just three numbers rather than three array columns. And since response_predict is not a data frame, names were not going to give you the column names of it.

    But actually all of that can be written, assuming that there are three possible classes, simply as

    predicted <- as.factor(colnames(response_predict)[apply(response_predict, 1, which.max)])
    

    which is much cleaner. I guess you are also aware that

    predicted <- as.factor(predict(CV, data.matrix(testdata), type = "class"))
    

    gives the same result as well.