Search code examples
rr-formula

extract variables in formula from a data frame


I have a formula that contains some terms and a data frame (the output of an earlier model.frame() call) that contains all of those terms and some more. I want the subset of the model frame that contains only the variables that appear in the formula.

ff <- log(Reaction) ~ log(1+Days) + x + y
fr <- data.frame(`log(Reaction)`=1:4,
                 `log(1+Days)`=1:4,
                 x=1:4,
                 y=1:4,
                 z=1:4,
                 check.names=FALSE)

The desired result is fr minus the z column (fr[,1:4] is cheating -- I need a programmatic solution ...)

Some strategies that don't work:

fr[all.vars(ff)]
## Error in `[.data.frame`(fr, all.vars(ff)) : undefined columns selected

(because all.vars() gets "Reaction", not log("Reaction"))

stripwhite <- function(x) gsub("(^ +| +$)","",x)
vars <- stripwhite(unlist(strsplit(as.character(ff)[-1],"\\+")))
fr[vars]
## Error in `[.data.frame`(fr, vars) : undefined columns selected

(because splitting on + spuriously splits the log(1+Days) term).

I've been thinking about walking down the parse tree of the formula:

ff[[3]]       ## log(1 + Days) + x + y
ff[[3]][[1]]  ## `+`
ff[[3]][[2]]  ## log(1 + Days) + x

but I haven't got a solution put together, and it seems like I'm going down a rabbit hole. Ideas?


Solution

  • This should work:

    > fr[gsub(" ","",rownames(attr(terms.formula(ff), "factors")))]
      log(Reaction) log(1+Days) x y
    1             1           1 1 1
    2             2           2 2 2
    3             3           3 3 3
    4             4           4 4 4
    

    And props to Roman Luštrik for pointing me in the right direction.

    Edit: Looks like you could pull it out off the "variables" attribute as well:

    fr[gsub(" ","",attr(terms(ff),"variables")[-1])]
    

    Edit 2: Found first problem case, involving I() or offset():

    ff <- I(log(Reaction)) ~ I(log(1+Days)) + x + y
    fr[gsub(" ","",attr(terms(ff),"variables")[-1])]
    

    Those would be pretty easy to correct with regex, though. BUT, if you had situations like in the question where a variable is called, e.g., log(x) and is used in a formula alongside something like I(log(y)) for variable y, this will get really messy.