Search code examples
rlm

Way to extract data from lm-object before function is applied?


let me directly dive into an example to show my problem:

 rm(list=ls())
 n <- 100
 df <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n) )
 fm <- lm(y ~ x1 + poly(x2, 2), data=df)

Now, I would like to have a look at the previously used data. This is almost available by using

 temp.data <- fm$model

However, x2will have been split up into poly(x2,2), which will itself be a dataframe as it contains a value for x2 and x2^2. Note that it may seem as if x2 is contained here, but since the polynomal uses orthogonal components, temp.data$x2 is not the same as df$x2. This can also be seen if you compare the variables visually after, say, the following: new.dat <- cbind(df, fm$model).

Now, to some questions:

First, and most importantly, is there a way to retrieve x2 from the lm-object in its original form. Or more generally, if some function f has been applied to some variable in the lm-formula, can the underlying variables be extracted from the lm-object (without doing case-specific math)? Note that I know I could retrieve the data by other means, but I wonder if I can extract it from the lm-object itself.

Second, on a more general note, since I did explicitly not ask for model.matrix(fm), why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?

Third, the command head(new.dat) shows me that x2 has been split up in two components. What I see when I type View(new.dat) is, however, only one column. This strikes me as puzzling and mindboggling. How can two colums be represented as one, and why is there a difference between head and View? If anyone can explain, I would be highly indebted!

If these questions are too basic, please apologize. In this case, I would appreciate any pointers to relevant manuals where this is explained.

Thanks in advance!


Solution

  • Good question, but this is difficult. fm$model is a weird data frame, of a type that would be hard for a user to construct, but which R sometimes generates internally. Check out the first few lines of str(fm$model), which show you that it's a data frame whose third component is an object of class poly with dimensions (100,2) -- i.e. something like a matrix:

    ## 'data.frame':    100 obs. of  3 variables:
    ##  $ y          : num  -0.5952 -1.9561 1.8467 -0.2782 -0.0278 ...
    ##  $ x1         : num  0.423 -1.539 -0.694 0.254 -0.13 ...
    ##  $ poly(x2, 2): poly [1:100, 1:2] 0.0606 -0.0872 0.0799 -0.1068 -0.0395 ...
    

    If you're still working in the environment from which lm was called in the first place, and if lm was called using the data argument, you can use eval(getCall(fm)$data) to get the original data. If things are being passed in and out of functions, or if someone used lm on independent objects in the environment, you're probably out of luck. If you get in trouble you can try

    eval(getCall(fm)$data,environment(formula(fm))
    

    but things rapidly start getting harder.

    I don't fully understand the logic of storing the processed model frame rather than the raw data, but I think it has to do with the construction of the terms object for the linear model -- each element in the stored model frame corresponds to an element of the terms object. I don't really understand the distinction between factors -- which are post-processed by model.matrix into sets of columns of dummy variables -- and transformed data (e.g. log(x)) or special objects like polynomial or spline bases ...

    To the extent that this is explained anywhere in detail, it's probably in the "White Book" (Chambers, J. M., and T. J. Hastie, eds. 1991. Statistical Models in S 1st ed. Chapman and Hall/CRC), but even there I don't think there's a completely satisfying answer.