I have a model predicted using logistic regression using cv.glm on a training dataset and when I predict it on testdata
and try to generate a confusion matrix it is throwing error.The classes of both train and testdata
set are unbalanced.
Here are the dimensions of both test and train datasets. Both my traindata
and testdata
is from a big dataset of 1234 columns and 60 rows I split it randomly into two sets to check the sensitivity and specificity of classfication at the end.
> dim(traindata)
40 1234
> dim(testdata)
[1] 20 1234
And this is what I tried.
Subtype = factor(traindata$Subtype)
CV=cv.glmnet(x=data.matrix(traindata),y=Subtype,standardize=TRUE,alpha=0,nfolds=3,family="multinomial")
response_predict=predict(CV, data.matrix(testdata),type="response")
predicted = as.factor(names(response_predict)[1:3][apply(response_predict[1:3], 1, which.max)])
Here it throws error as:
Error in apply(response_predict[1:3], 1, which.max) :
dim(X) must have a positive length
My question is with to proceed in such unbalanced dataset using cv.glm
and how to get rid of the above mentioned error.
Thank you
Unbalancedness has nothing to do with this error. First, response_predict
is an array, not a matrix and not a data frame. For this reason, the last line should be
predicted <- as.factor(colnames(response_predict[, , 1])[1:3][apply(response_predict[, 1:3, 1], 1, which.max)])
That is, since we are dealing with a three dimensional array, we have three indices. Also response_predict[1:3]
meant just three numbers rather than three array columns. And since response_predict
is not a data frame, names
were not going to give you the column names of it.
But actually all of that can be written, assuming that there are three possible classes, simply as
predicted <- as.factor(colnames(response_predict)[apply(response_predict, 1, which.max)])
which is much cleaner. I guess you are also aware that
predicted <- as.factor(predict(CV, data.matrix(testdata), type = "class"))
gives the same result as well.