Search code examples
rregression

how to efficiently perform regression using groups of variables


Is there a way of doing the following in R ? I mean by using only the functionality native to R.

Say I have a data frame, df, with 8 columns: x1 to x8. And I want to regress x8 on x1 to x7.

Let's denote the variables (x1, x2, x3) as P. And the variables (x4, x5, x6) can be Q.

Given these definitions, is there a way for me to execute something like the following more succinctly ?

model = lm(x8 ~ (x1 + x2 + x3) * (x4 + x5 + x6) + x7, data=df)

As it stands, I just cobble stuff together with paste() and plop it into as.formula().

model = lm(as.formula(paste(stmt, sep="")), data=df)

Consider also that this isn't for a particular analysis. If there's a simple way of doing it, great. If not, I'll just use duct tape like I am now.


Solution

  • The only way I could think of is to simply use matrices:

    P <- c("x1", "x2", "x3")
    Q <- c("x4", "x5", "x6")
    other <- "x7"
    response <- "x8"
    
    df2 <- as.matrix(df)
    

    Then do:

    lm(df2[,response] ~ df2[,P] * df2[,Q] + df2[,other])
    

    Note that you could subset the matrices outside the formula. ie:

    p <- df2[,P]
    q <- df2[,Q]
    o <- df2[,other]
    y <- df2[,response]
    
    lm(y ~ p * q + o)
    

    Edit:

    invoke the use of factor:

    f <- reformulate(c(P, Q, levels(factor(P):factor(Q)), other), response)
    f
    x8 ~ x1 + x2 + x3 + x4 + x5 + x6 + x1:x4 + x1:x5 + x1:x6 + x2:x4 + 
        x2:x5 + x2:x6 + x3:x4 + x3:x5 + x3:x6 + x7
    
    lm(f, df)