Search code examples
rlinear-regressiongenetics

Using glm in R for linear regression on a large dataframe - issues with column subsetting


I am trying to use glm in R using a dataframe containing ~ 1000 columns, where I want to select a specific independent variable and run as a loop for each of the 1000 columns representing the dependent variables.

As a test, the glm equation works perfectly fine when I specify a single column using df$col1 for both my dependent and independent variables.

I can't seem to correctly subset a range of columns (below) and I keep getting this error, no matter how many ways I try to format the df:

'data' must be a data.frame, environment, or list

What I tried:

df = my df
cols <- df[, 20:1112]

for (i in cols{
    glm <- glm(df$col1 ~ ., data=df, family=gaussian)
}

Solution

  • It would be more idiomatic to do:

    predvars <- names(df)[20:1112]
    glm_list <- list()  ## presumably you want to save the results??
    for (pv in predvars) {
        glm_list[[pv]] <- glm(reformulate(pv, response = "col1"), 
           data=df, family=gaussian)
    }
    

    In fact, if you really just want to do a Gaussian GLM then it will be slightly faster to use

    lm(reformulate(pv, response = "col1"), data = df)
    

    in the loop instead.

    If you want to get fancy:

    formlist <- lapply(predvars, reformulate, response = "col1")
    lm_list <- lapply(formlist, lm, data = df)
    names(lm_list) <- predvars