Search code examples
rlmtidyversepurrrbroom

Passing list of data frames into lm() and viewing the results


I've got three data frames, dfLON, dfMOS, and dfATA. Each features the same variables: y is a continuous variable, and a, b, and c are binary categorical, and there is also some NA.

I'd like to build separate linear regression models, one for each data set.

With my current code I've managed to make a list of data frames and pass it into lm(). But is there are more concise way to view the results than eg fitdfLON <- DfList[[1]]? I've provided three data frames in this example but I actually have ~25 so I'd have to type it 25 times!

Any help would be much appreciated.

Starting point (dfs):

dfLON <- data.frame(y=c(1.23,2.32,3.21,2.43),a=c(1,NA,1,2),b=c(1,1,2,2),c=c(2,1,2,1))
dfMOS <- data.frame(y=c(4.56,6.54,4.43,5.78),a=c(2,1,2,1),b=c(2,1,1,2),c=c(1,2,1,2))
dfATA <- data.frame(y=c(1.22,6.54,3.23,4.23),a=c(2,2,2,1),b=c(1,2,1,2),c=c(1,NA,1,2))

Current code:

Mylm <- function(df){
 fit <- lm(y ~ a + b + c, data=df)
  return(fit)
}
DfList <- lapply(list(dfLON, dfMOS, dfATA), Mylm)

fitdfLON <- DfList[[1]]
fitdfMOS <- DfList[[2]]
fitdfATA <- DfList[[3]]

Solution

  • Whenever you're running models on many different datasets, it makes sense to tidy them using the broom library. This produces a clean data frame for each model, which you can then output or use in downstream analyses.

    Simplest example:

    library(broom)
    
    Mylm <- function(df){
      fit <- lm(y ~ a + b + c, data=df)
      tidy(fit) # tidy the fit object
    }
    
    list(dfLON, dfMOS, dfATA) %>% lapply(Mylm)
    
    #[[1]]
    #         term estimate std.error statistic p.value
    #1 (Intercept)     0.03       NaN       NaN     NaN
    #2           a    -0.78       NaN       NaN     NaN
    #3           b     1.98       NaN       NaN     NaN
    #
    #[[2]]
    #         term estimate std.error  statistic    p.value
    #1 (Intercept)   8.2975  0.969855  8.5554025 0.07407531
    #2           a  -1.6650  0.445000 -3.7415730 0.16626155
    #3           b  -0.3150  0.445000 -0.7078652 0.60785169
    #
    #[[3]]
    #         term estimate std.error statistic   p.value
    #1 (Intercept)    6.235  3.015000  2.067993 0.2867398
    #2           a   -2.005  1.740711 -1.151828 0.4551559
    

    And you can now combine this with the map_dfr() function from purrr to combine everything into one combined data frame:

    library(purrr)
    
    # note the named list entries; these will go into the "model" column
    # without them, you'd just get a model number
    list("LON" = dfLON, "MOS" = dfMOS, "ATA" = dfATA) %>% 
      map_dfr(Mylm, .id = "model")
    
    #  model        term estimate std.error  statistic    p.value
    #1   LON (Intercept)   0.0300       NaN        NaN        NaN
    #2   LON           a  -0.7800       NaN        NaN        NaN
    #3   LON           b   1.9800       NaN        NaN        NaN
    #4   MOS (Intercept)   8.2975  0.969855  8.5554025 0.07407531
    #5   MOS           a  -1.6650  0.445000 -3.7415730 0.16626155
    #6   MOS           b  -0.3150  0.445000 -0.7078652 0.60785169
    #7   ATA (Intercept)   6.2350  3.015000  2.0679934 0.28673976
    #8   ATA           a  -2.0050  1.740711 -1.1518281 0.45515586
    

    And to make things more compact, you can define the function on the fly inside map_dfr. Seems appropriate when all you're doing is fit a linear model.

    list("LON" = dfLON, "MOS" = dfMOS, "ATA" = dfATA) %>% 
      map_dfr(~ tidy(lm(y ~ a + b + c, data = .)),
              .id = "model")
    
    #  model        term estimate std.error  statistic    p.value
    #1   LON (Intercept)   0.0300       NaN        NaN        NaN
    #2   LON           a  -0.7800       NaN        NaN        NaN
    #3   LON           b   1.9800       NaN        NaN        NaN
    #4   MOS (Intercept)   8.2975  0.969855  8.5554025 0.07407531
    #5   MOS           a  -1.6650  0.445000 -3.7415730 0.16626155
    #6   MOS           b  -0.3150  0.445000 -0.7078652 0.60785169
    #7   ATA (Intercept)   6.2350  3.015000  2.0679934 0.28673976
    #8   ATA           a  -2.0050  1.740711 -1.1518281 0.45515586