I've got three data frames, dfLON, dfMOS, and dfATA. Each features the same variables: y is a continuous variable, and a, b, and c are binary categorical, and there is also some NA
.
I'd like to build separate linear regression models, one for each data set.
With my current code I've managed to make a list of data frames and pass it into lm(). But is there are more concise way to view the results than eg fitdfLON <- DfList[[1]]
? I've provided three data frames in this example but I actually have ~25 so I'd have to type it 25 times!
Any help would be much appreciated.
Starting point (dfs):
dfLON <- data.frame(y=c(1.23,2.32,3.21,2.43),a=c(1,NA,1,2),b=c(1,1,2,2),c=c(2,1,2,1))
dfMOS <- data.frame(y=c(4.56,6.54,4.43,5.78),a=c(2,1,2,1),b=c(2,1,1,2),c=c(1,2,1,2))
dfATA <- data.frame(y=c(1.22,6.54,3.23,4.23),a=c(2,2,2,1),b=c(1,2,1,2),c=c(1,NA,1,2))
Current code:
Mylm <- function(df){
fit <- lm(y ~ a + b + c, data=df)
return(fit)
}
DfList <- lapply(list(dfLON, dfMOS, dfATA), Mylm)
fitdfLON <- DfList[[1]]
fitdfMOS <- DfList[[2]]
fitdfATA <- DfList[[3]]
Whenever you're running models on many different datasets, it makes sense to tidy them using the broom library. This produces a clean data frame for each model, which you can then output or use in downstream analyses.
Simplest example:
library(broom)
Mylm <- function(df){
fit <- lm(y ~ a + b + c, data=df)
tidy(fit) # tidy the fit object
}
list(dfLON, dfMOS, dfATA) %>% lapply(Mylm)
#[[1]]
# term estimate std.error statistic p.value
#1 (Intercept) 0.03 NaN NaN NaN
#2 a -0.78 NaN NaN NaN
#3 b 1.98 NaN NaN NaN
#
#[[2]]
# term estimate std.error statistic p.value
#1 (Intercept) 8.2975 0.969855 8.5554025 0.07407531
#2 a -1.6650 0.445000 -3.7415730 0.16626155
#3 b -0.3150 0.445000 -0.7078652 0.60785169
#
#[[3]]
# term estimate std.error statistic p.value
#1 (Intercept) 6.235 3.015000 2.067993 0.2867398
#2 a -2.005 1.740711 -1.151828 0.4551559
And you can now combine this with the map_dfr()
function from purrr to combine everything into one combined data frame:
library(purrr)
# note the named list entries; these will go into the "model" column
# without them, you'd just get a model number
list("LON" = dfLON, "MOS" = dfMOS, "ATA" = dfATA) %>%
map_dfr(Mylm, .id = "model")
# model term estimate std.error statistic p.value
#1 LON (Intercept) 0.0300 NaN NaN NaN
#2 LON a -0.7800 NaN NaN NaN
#3 LON b 1.9800 NaN NaN NaN
#4 MOS (Intercept) 8.2975 0.969855 8.5554025 0.07407531
#5 MOS a -1.6650 0.445000 -3.7415730 0.16626155
#6 MOS b -0.3150 0.445000 -0.7078652 0.60785169
#7 ATA (Intercept) 6.2350 3.015000 2.0679934 0.28673976
#8 ATA a -2.0050 1.740711 -1.1518281 0.45515586
And to make things more compact, you can define the function on the fly inside map_dfr
. Seems appropriate when all you're doing is fit a linear model.
list("LON" = dfLON, "MOS" = dfMOS, "ATA" = dfATA) %>%
map_dfr(~ tidy(lm(y ~ a + b + c, data = .)),
.id = "model")
# model term estimate std.error statistic p.value
#1 LON (Intercept) 0.0300 NaN NaN NaN
#2 LON a -0.7800 NaN NaN NaN
#3 LON b 1.9800 NaN NaN NaN
#4 MOS (Intercept) 8.2975 0.969855 8.5554025 0.07407531
#5 MOS a -1.6650 0.445000 -3.7415730 0.16626155
#6 MOS b -0.3150 0.445000 -0.7078652 0.60785169
#7 ATA (Intercept) 6.2350 3.015000 2.0679934 0.28673976
#8 ATA a -2.0050 1.740711 -1.1518281 0.45515586