Search code examples
rpca

Errors making predictions on MCA object FactoMineR


I'm trying to get new coordinates from an MCA analysis in R, using MCA from the FactoMineR package.

Where df is a dataframe,

res = MCA(df)
predict.MCA(res, df)

Produces an error :

Error in predict.MCA(res, df) : 
  The following categories are not in the active dataset: 0.01.12.13.11.22.21.30.01.12.13.11.22.23.21.32.33.30.01.12.13.11.22.23.21.30.01.12.13.11.22.23.21.32.33.30.01.12.13.11.22.23.21.32.33.30.015.1610.111.215.2610.211.315.3610.3

I'm unsure of how the categories can be different because it's the exact same dataframe (df) in both MCA and predict. (I did this just for debugging because I originally got this error while trying to convert my test set.)

I tried using droplevels for every column of the input dataframe, but I get the same error.

Any help appreciated.


Solution

  • Without a reproducible example from you, it's difficult to diagnose this problem, however there are a few places to start.

    If you look at where your error is coming from within the predict.mca() function, it's coming from these lines (edited to a be a bit more legible):

        olddata <- object$call$X[,
                                 rownames(object$var$eta),
                                 drop=FALSE]
        newdata <- newdata[,
                           colnames(olddata),
                           drop = FALSE]
        pb <- NULL
        for (i in 1:ncol(newdata)) {
          if (sum(!levels(newdata[ ,i]) %in% levels(olddata[ ,i])) > 0) {
            pb <- c(pb, levels(newdata[, i])[which(!levels(newdata[, i]) %in% levels(olddata[ ,i]))])
          }
        }
        if (!is.null(pb)) {
          stop("The following categories are not in the active dataset: ",pb)
        }
    

    Luckily, this error reproduces itself with one of FactoMiner's own examples, and it's a bit interesting, so using the hobbies dataset:

    library(FactoMineR)
    
    hobbies <- read.table("http://factominer.free.fr/course/doc/data_MCA_Hobbies.csv",
                          header = TRUE,
                          sep = ";")
    hobbies[, "TV"] <- as.factor(hobbies[, "TV"])
    
    res.mca <- MCA(hobbies,
                   quali.sup = 19:22,
                   quanti.sup = 23)
    
    pred.obj <- predict.MCA(object = res.mca,
                            newdata = hobbies)
    

    returns the error:

    Error in predict.MCA(object = res.mca, newdata = hobbies) : 
      The following categories are not in the active dataset: nynynynynynynynynynynynynynynynyny
    

    So what's going on here? If you hack FactoMiner's code a bit, and return their 'olddata' and 'newdata' objects, what you'll see is that the factors in their 'olddata' object have been renamed from "n" and "y" (in the hobbies dataset) to "columnname_n/y".

    This just involves inserting:

    return(list(olddata, newdata)
    

    on the next line right after it creates those assignments in the 'predict.mca()', then:

    X <- predict.MCA(object = res.mca,
                     newdata = hobbies)
    

    and then:

    levels(X[[1]][, 1L])
    [1] "Reading_n" "Reading_y"
    levels(X[[2]][, 1L])
    [1] "n" "y"
    

    So FactoMiner is looking for factors in it's new data, to match to factors in it's old data. Not being a FactoMiner user, it's not entirely clear to me why they would want to code their data frame this way. There might be a reason, but it seems to me that this would actually affect any data that gets fed to this function. You can hack this if you want by either a) just converting your new data to follow this format, or b) converting their adjusted format back to it's original.

    It's likely that a is the better choice, but who knows. It would look something like this for the 'hobbies' dataset:

    new.hobbies <- hobbies
    
    for (m1 in seq_len(ncol(hobbies))) {
      new.hobbies[, m1] <- as.factor(paste(colnames(new.hobbies)[m1],
                                           as.character(new.hobbies[, m1]),
                                           sep = "_"))
    }
    

    This still actually will error out for the hobbies data set in your use case, because MCA apparently doesn't paste column names onto factors derived from integers, but hopefully this gets you part of the way to what you wanted. If you could provide an actual example of your data we might be able to help out to a clearer resolution.