I have the following data frame:
input.df <- dplyr::data_frame(x = rnorm(4),
y = rnorm(4),
`z 1` = rnorm(4))
I would like to do a multiple regression for each column with the other columns and extract the R-squared from each model. This means that I could run the following code:
summary(lm(x ~ ., data = input.df))
summary(lm(y ~ ., data = input.df))
summary(lm(`z 1` ~ ., data = input.df))
And note down the R-squared.
I'd like to automate this task and have two column data frame where the first column is the dependent variable and the second column is the R-squared.
This is what I've tried:
n <- ncol(input.df)
replicate(n, input.df, simplify = F) %>%
dplyr::bind_rows() %>%
dplyr::mutate(group = rep(names(.), each = nrow(.) / n)) %>%
dplyr::group_by(group) %>%
dplyr::do({
tgt.var <- .$group[1]
# How do I get the formula to interpret . as all variables?
lm(get(tgt.var) ~ ., data = .) %>%
broom::glance() %>%
dplyr::select(r.squared)
})
I've put a comment on the part I am stuck. I get the following error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
I think you've overcomplicated building your dataframe a little. There is no need for replicate
as you are running all regressions on the same dataset. You could just use map
from purrr
, the idea is to try something like
library(purrr)
names(input.df) %>%
map(~ lm(get(.) ~ ., data = input.df))
This runs without errors but doesn't give the desired result. The reason is that get(.)
gets added as a new variable in the dataset, so for example the first regression is x ~ x + y + `z 1`
which is not what we want. This can be easily fixed though by changing the formula in lm
as follows
names(input.df) %>%
map(~ lm(formula(paste0('`', ., '` ~ .')), data = input.df))
(note the need to include the escape backticks because of the name of your third variable, otherwise it wouldn't have been necessary). This now gives the desired results. If you don't want to keep everything and want to extract r2 you can just do
names(input.df) %>%
map(~ lm(formula(paste0('`', ., '` ~ .')), data = input.df)) %>%
map(summary) %>%
map_dbl('r.squared')