let me directly dive into an example to show my problem:
rm(list=ls())
n <- 100
df <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n) )
fm <- lm(y ~ x1 + poly(x2, 2), data=df)
Now, I would like to have a look at the previously used data. This is almost available by using
temp.data <- fm$model
However, x2
will have been split up into poly(x2,2)
, which will itself be a dataframe as it contains a value for x2
and x2^2
. Note that it may seem as if x2
is contained here, but since the polynomal uses orthogonal components, temp.data$x2
is not the same as df$x2
. This can also be seen if you compare the variables visually after, say, the following: new.dat <- cbind(df, fm$model)
.
Now, to some questions:
First, and most importantly, is there a way to retrieve x2
from the lm-object in its original form. Or more generally, if some function f
has been applied to some variable in the lm-formula, can the underlying variables be extracted from the lm-object (without doing case-specific math)? Note that I know I could retrieve the data by other means, but I wonder if I can extract it from the lm-object itself.
Second, on a more general note, since I did explicitly not ask for model.matrix(fm)
, why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?
Third, the command head(new.dat)
shows me that x2
has been split up in two components. What I see when I type View(new.dat)
is, however, only one column. This strikes me as puzzling and mindboggling. How can two colums be represented as one, and why is there a difference between head
and View
? If anyone can explain, I would be highly indebted!
If these questions are too basic, please apologize. In this case, I would appreciate any pointers to relevant manuals where this is explained.
Thanks in advance!
Good question, but this is difficult. fm$model
is a weird data frame, of a type that would be hard for a user to construct, but which R sometimes generates internally. Check out the first few lines of str(fm$model)
, which show you that it's a data frame whose third component is an object of class poly with dimensions (100,2) -- i.e. something like a matrix:
## 'data.frame': 100 obs. of 3 variables:
## $ y : num -0.5952 -1.9561 1.8467 -0.2782 -0.0278 ...
## $ x1 : num 0.423 -1.539 -0.694 0.254 -0.13 ...
## $ poly(x2, 2): poly [1:100, 1:2] 0.0606 -0.0872 0.0799 -0.1068 -0.0395 ...
If you're still working in the environment from which lm
was called in the first place, and if lm
was called using the data
argument, you can use eval(getCall(fm)$data)
to get the original data. If things are being passed in and out of functions, or if someone used lm
on independent objects in the environment, you're probably out of luck. If you get in trouble you can try
eval(getCall(fm)$data,environment(formula(fm))
but things rapidly start getting harder.
I don't fully understand the logic of storing the processed model frame rather than the raw data, but I think it has to do with the construction of the terms
object for the linear model -- each element in the stored model frame corresponds to an element of the terms
object. I don't really understand the distinction between factors -- which are post-processed by model.matrix
into sets of columns of dummy variables -- and transformed data (e.g. log(x)
) or special objects like polynomial or spline bases ...
To the extent that this is explained anywhere in detail, it's probably in the "White Book" (Chambers, J. M., and T. J. Hastie, eds. 1991. Statistical Models in S 1st ed. Chapman and Hall/CRC), but even there I don't think there's a completely satisfying answer.