Search code examples
rdataframedetectionlmmodel.matrix

How can I obtain a minimal data frame of only the variables used in a statistical model in R?


Take the following example:

fit <- lm(Sepal.Length ~ log(Sepal.Width), data = iris)

I would like a copy of iris that only includes the variables that were involved in making fit. I think model.matrix() or model.frame() don't quite do it because of the log; they will include log(Sepal.Width) but not Sepal.Width. I want basically a minimal version of iris that only includes variables that were used in making fit. How can I do that? This of course is an example and I would like a more general solution (say you had a number of variables used in making a fit, many passed through transformations that are not necessarily invertible).


Solution

  • I think what you want is get_all_vars()

    get_all_vars(fit, data = iris)
    

    Output:

    #    Sepal.Length Sepal.Width
    #1            5.1         3.5
    #2            4.9         3.0
    #3            4.7         3.2
    #4            4.6         3.1
    #5            5.0         3.6
    #6            5.4         3.9
    #7            4.6         3.4
    # ...
    

    This returns untransformed variables (ie, Sepal.Width instead of log(Sepal.Width), as seen here:

    all.equal(iris$Sepal.Width, 
              get_all_vars(fit, data = iris)$Sepal.Width)
    
    #[1] TRUE