Search code examples
rperformancelapply

R Loop Over Datasetes and Store Model Coefficients


I have 3 data sets and wish to run the same linear model on all of them, store the coefficient and its upper and lower confidence limits.

set.seed(1)
    school1 = data.frame(student = sample(c(1:100), 100, r = T),
                         score = runif(100))
    school2 = data.frame(student = sample(c(1:100), 100, r = T),
                         score = runif(100))
    school3 = data.frame(student = sample(c(1:100), 100, r = T),
                         score = runif(100))
                         
    schools = list('school1', 'school2', 'school3')
    storage <- vector('list', length(schools))
    
    for(i in seq_along(schools)){
      tmpdat <- schools[[i]]
      tmp <- lm(score ~ x1, data = tmpdat)
      storage[[i]] <- summary(tmp)$coef[1]
    }

I wish to make WANT which stores all the information and also the name of dataset:

WANT = data.frame(data = c('school1', 'school2', 'school3'),
                  coef = c(0,0,0),
                  coefLL = c(0,0,0),
                  coefUL=c(0,0,0))

but I am struggling,, I loop over the datasets but do not know how to store all the information I need....Also I have this for like 1000 data sets so the most efficient way possible is the best thank you so much


Solution

  • There are a few odd things about your setup - you don't have a list of school data sets, you have a list of school names? By "the coefficient" do you mean you're only interested in the slope (throwing away the intercept?) Why do you have a predictor variable x1 in your model when it's not in your data ... ?

    library(broom)
    library(tidyverse)
    schoolnames <- c('school1', 'school2', 'school3')
    schools <- mget(schoolnames)
    res <- vector(length = 3, mode = "list")
    names(res) <- schoolnames
    for(i in seq_along(schools)){
          tmp <- lm(score ~ student, data = schools[[i]])
          res[[i]] <- (tidy(tmp, conf.int = TRUE)
               |> filter(term == "student")
               |> select(estimate, conf.low, conf.high)
          )
        }
    WANT <- bind_rows(res, .id = "school")
    

    You could also use purrr::map() for this ...

    If for some reason you wanted to do this in a lower-tech way, you could:

    res <- data.frame(schools = schoolnames, est = rep(NA,3),
                      lwr = rep(NA,3), upr = rep(NA,3))
    for(i in seq_along(schools)){
          tmp <- lm(score ~ student, data = schools[[i]])
          ## use element 2/row 2 to pick out the slope coefficient/CIs
          res[i,1] <- coef(tmp)[2]
          res[i,2] <- confint(tmp)[2,1]  ## lower CI in column 1
          res[i,3] <- confint(tmp)[2,2]
    }