Search code examples
rcpp

How to call model.matrix or equivalent from RCPP, possibly in threaded code?


we were hoping to use threads to get things going faster in an algorithm with many loops whose results are not interdependent.

within the code we hoped to port to rcpp, there is a call to model.matrix.

This did not appear straightforward to port.

Investigating this further (as to what code this runs for our use case), revealed that the S3 method for lm objects does some preparatory work on the variable and then calls the default version of the function as can be seen in this copy-paste of the code:

function (object, ...) 
{
    if (n_match <- match("x", names(object), 0L)) 
        object[[n_match]]
    else {
        data <- model.frame(object, xlev = object$xlevels, ...)
        if (exists(".GenericCallEnv", inherits = FALSE)) 
            NextMethod("model.matrix", data = data, contrasts.arg = object$contrasts)
        else {
            dots <- list(...)
            dots$data <- dots$contrasts.arg <- NULL
            do.call("model.matrix.default", c(list(object = object, 
                data = data, contrasts.arg = object$contrasts), 
                dots))
        }
    }
}

the default version of the function farms at least some of its functionality out to a compiled C function:

function (object, data = environment(object), contrasts.arg = NULL, 
    xlev = NULL, ...) {
    t <- if (missing(data)) 
        terms(object)
    else terms(object, data = data)
    if (is.null(attr(data, "terms"))) 
        data <- model.frame(object, data, xlev = xlev)
    else {
        reorder <- match(vapply(attr(t, "variables"), deparse2, 
            "")[-1L], names(data))
        if (anyNA(reorder)) 
            stop("model frame and formula mismatch in model.matrix()")
        if (!identical(reorder, seq_len(ncol(data)))) 
            data <- data[, reorder, drop = FALSE]
    }
    int <- attr(t, "response")
    if (length(data)) {
        contr.funs <- as.character(getOption("contrasts"))
        namD <- names(data)
        for (i in namD) if (is.character(data[[i]])) 
            data[[i]] <- factor(data[[i]])
        isF <- vapply(data, function(x) is.factor(x) || is.logical(x), 
            NA)
        isF[int] <- FALSE
        isOF <- vapply(data, is.ordered, NA)
        for (nn in namD[isF]) if (is.null(attr(data[[nn]], "contrasts"))) 
            contrasts(data[[nn]]) <- contr.funs[1 + isOF[nn]]
        if (!is.null(contrasts.arg)) {
            if (!is.list(contrasts.arg)) 
                warning("non-list contrasts argument ignored")
            else {
                if (is.null(namC <- names(contrasts.arg))) 
                  stop("'contrasts.arg' argument must be named")
                for (nn in namC) {
                  if (is.na(ni <- match(nn, namD))) 
                    warning(gettextf("variable '%s' is absent, its contrast will be ignored", 
                      nn), domain = NA)
                  else {
                    ca <- contrasts.arg[[nn]]
                    if (is.matrix(ca)) 
                      contrasts(data[[ni]], ncol(ca)) <- ca
                    else contrasts(data[[ni]]) <- contrasts.arg[[nn]]
                  }
                }
            }
        }
    }
    else {
        isF <- FALSE
        data[["x"]] <- raw(nrow(data))
    }
    ans <- .External2(C_modelmatrix, t, data)
    if (any(isF)) 
        attr(ans, "contrasts") <- lapply(data[isF], attr, 
            "contrasts")
    ans
}

is there some way of calling C_modelmatrix from Rcpp at all, whether it is single OR multi-threaded? Is there any library or package that does essentially the same thing from within Rcpp so I don't have to reinvent the wheel here? I'd rather not have to fully re-implement everything that model.matrix does if I can avoid it.

as we don't actually have functioning code, there isn't any to show for this yet.

The relevant portion of the function we were trying to speed up calls model.matrix like this: ("model.y is an lm", data are both copies of an original object returned by model.frame(model.y) )

ymat.t <- model.matrix(terms(model.y), data=pred.data.t)
ymat.c <- model.matrix(terms(model.y), data=pred.data.c)

this isn't really a results based question, more of an approach/methods based question


Solution

  • You can call model.matrix from within C++, but you cannot do so in a multi-threaded way.

    There will also be overhead, but if the function call is needed deep within the middle of your code, it could be worth it as a convenience.

    Example:

    // [[Rcpp::export]]
    RObject call(RObject x, RObject y){
      Environment env = Environment::global_env();
      Function f = env["model.matrix"];
      RObject res = f(x,y);
      return res;
    }