Search code examples
rmachine-learninglars

Error message when using predict with LARS model on testdata


I use a lars model and apply it to a large data set (75 features) with numerical data and factors.

I train the model by

mm <- model.matrix(target~0+.,data=data)
larsMod <- lars(mm,data$target,intercept=FALSE)

which gives a nice in-sample fit. If I apply it to testdata by

mm.test <- model.matrix(target~0+.,,data=test.data)
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))

then I get the error message

Error in scale.default(newx, object$meanx, FALSE) : 
  length of 'center' must equal the number of columns of 'x'

I assume that it has todo with the fact that factor levels differ in the data sets. However

which(! colnames(mm.test) %in% colnames(mm) )

gives an empty result while

which(! colnames(mm) %in% colnames(mm.test) )

gives 3 indizes. Thus 3 factor levels do appear in the training set but not in the test set. Why does this cause a problem? How can I solve this?

The code blow illustrates this with a toy example. In the test dataset the factor does not have the level "l3".

require(lars)

data.train = data.frame( target = c(0,1,0,1,1,1,1,0,0,0), f1 = rep(c("l1","l2","l1","l2","l3"),2), n1 = rep(c(1,2,3,4,5),2))
test.data = data.frame(f1 = rep(c("l1","l2","l1","l2","l2"),2),n1 = rep(c(7,4,3,4,5),2) )

mm <- model.matrix(target~0+f1+n1,data = data.train)
colnames(mm)
length(colnames(mm))
larsMod <- lars(mm,data.train$target,intercept=FALSE)

mm.test <- model.matrix(~0+f1+n1,data=test.data)
colnames(mm.test)
length( colnames(mm.test) )
which(! colnames(mm.test) %in% colnames(mm) )
which(! colnames(mm) %in% colnames(mm.test) )

predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))

Solution

  • I might be very much off here, but in my field predict doesn't work if it can't find a variable it expects. So I tried what happened if I forced the model matrix to 0 for the factor (f1l3) that was not in the test data.

    Note1: I created a target variable in the testdata, because I couldn't get your code to run otherwise

    set.seed(123)
    test.data$target <- rbinom(nrow(test.data),1,0.2)
    
    
    #proof of concept:
    mm.test <- model.matrix(target~0+f1+n1,data=test.data)
    mm.test1 <- cbind(f1l3=0,mm.test)
    
    predict(larsMod,mm.test1[,colnames(mm)],type="fit",s=length(larsMod$arc.length)) #runs
    #runs!
    

    Now generalize to allow for creation of a 'complete' model matrix when factors are missing in testdata.

    #missing columns
    mis_col <- setdiff(colnames(mm), colnames(mm.test))
    
    #matrix of missing levels
    mis_mat <- matrix(0,ncol=length(mis_col),nrow=nrow(mm.test))
    colnames(mis_mat) <- mis_col
    
    #bind together
    mm.test2 <- cbind(mm.test,mis_mat)[,colnames(mm)] #to ensure ordering, yielded different results in my testing
    predict(larsMod,mm.test2,type="fit",s=length(larsMod$arc.length)) #runs
    

    Note2: I don't know what happens if the problem is the other way around (factors present in testdata that were not in train data)