Search code examples
rfilterstatisticsrepeattidy

How do you repeat on filtering datasets and then running regressions without writing out individual code?


How do you repeat on filtering datasets and then running regressions without writing out individual code?

I want to run a linear regression on the mtcars data where the data are all of mtcars, the IV is mtcars$am, and the DV is mtcars$mpg. I then want to use the grouping variable mtcars$gear to create 3 datasets where mtcars$gear is 3, 4, or 5, and then runs the regressions again with these 3 datasets separately.

The long process that I currently used is below.

Unique values of variables of interest:

## variables of interets
unique(mtcars$mpg)
# ---- NOTE: DV is mpg
unique(mtcars$am)
# ---- NOTE: IV is mpg
unique(mtcars$gear)
# ---- NOTE: grouping variable is gear

Here is the baseline code I used for the regression:

## linear regression with all data
lm__am_on_mpg__mtcars <- lm(mpg ~ am, data=mtcars)
summary(lm__am_on_mpg__mtcars)

I then used the filter() command in the tidyverse package to create 3 datasets, where mtcars$gear is 3, 4, or 5

### list of filtered datasets
str(mtcars__gear_is_3)
str(mtcars__gear_is_4)
str(mtcars__gear_is_5)

I then created 3 regressions with the same basic structure as the base regression above, but with different datasets connected with different mtcars$gear levels.

#### when mtcars__gear_is_3 is dataset used
lm__am_on_mpg__mtcars__gear_is_3 <- lm(mpg ~ am, data=mtcars__gear_is_3)
summary(lm__am_on_mpg__mtcars__gear_is_3)

#### when mtcars__gear_is_4 is dataset used
lm__am_on_mpg__mtcars__gear_is_4 <- lm(mpg ~ am, data=mtcars__gear_is_4)
summary(lm__am_on_mpg__mtcars__gear_is_4)

#### when mtcars__gear_is_5 is dataset used
lm__am_on_mpg__mtcars__gear_is_5 <- lm(mpg ~ am, data=mtcars__gear_is_5)
summary(lm__am_on_mpg__mtcars__gear_is_5)

This seems to work, but it also seems to be a lot of code. I feel this could be accomplished with more concise code. I want to know if I can speed this process up by writing code that: (A) creates different datasets in a shorter way using the tidyverse filter method (B) creates different regressions in a shorter way that just swaps the dataset names when appropriate without having to write all of the code the long way.

Here are my questions: (1) Is this possible to do in R in general? (2) Is this possible with datasets? (2.1) If so, how? (3) Is this possible with regressions? (3.1) If so, how?

====================

Here is my R code that I used to complete this task the long way

# How do you repeat on filtering datasets and then running regressions in R without writing out individual code?

## dataset of interest
mtcars

### info about dataset
head(mtcars)
str(mtcars)
columns(mtcars)

## variables of interets
unique(mtcars$mpg)
# ---- NOTE: DV is mpg
unique(mtcars$am)
# ---- NOTE: IV is mpg
unique(mtcars$gear)
# ---- NOTE: grouping variable is gear

## linear regression with all data
lm__am_on_mpg__mtcars <- lm(mpg ~ am, data=mtcars)
summary(lm__am_on_mpg__mtcars)

## filter data based on mtcars$gear

### loads tidyverse
library(tidyverse)

### when mtcars$gear == 3

#### creates filtered dataset
# ---- NOTE: starting dataset - mtcars
# ---- NOTE: ending dataset - mtcars__gear_is_3
# ---- NOTE: filter variable - gear
# ---- NOTE: filter variable value(s) - 3

##### starting dataset
str(mtcars)

##### unique values of starting dataset$filter
unique(mtcars$gear)

##### filters data into post-filter dataset
mtcars__gear_is_3 <- filter(mtcars, (gear == "3"))

##### turns post-filter dataset into data frame
mtcars__gear_is_3 <- data.frame(mtcars__gear_is_3)

##### post-filter dataset
str(mtcars__gear_is_3)

##### unique values of post-filter dataset$filter
unique(mtcars__gear_is_3$gear)

### when mtcars$gear == 4

#### creates filtered dataset
# ---- NOTE: starting dataset - mtcars
# ---- NOTE: ending dataset - mtcars__gear_is_4
# ---- NOTE: filter variable - gear
# ---- NOTE: filter variable value(s) - 4

##### starting dataset
str(mtcars)

##### unique values of starting dataset$filter
unique(mtcars$gear)

##### filters data into post-filter dataset
mtcars__gear_is_4 <- filter(mtcars, (gear == "4"))

##### turns post-filter dataset into data frame
mtcars__gear_is_4 <- data.frame(mtcars__gear_is_4)

##### post-filter dataset
str(mtcars__gear_is_4)

##### unique values of post-filter dataset$filter
unique(mtcars__gear_is_4$gear)

### when mtcars$gear == 5

#### creates filtered dataset
# ---- NOTE: starting dataset - mtcars
# ---- NOTE: ending dataset - mtcars__gear_is_5
# ---- NOTE: filter variable - gear
# ---- NOTE: filter variable value(s) - 5

##### starting dataset
str(mtcars)

##### unique values of starting dataset$filter
unique(mtcars$gear)

##### filters data into post-filter dataset
mtcars__gear_is_5 <- filter(mtcars, (gear == "5"))

##### turns post-filter dataset into data frame
mtcars__gear_is_5 <- data.frame(mtcars__gear_is_5)

##### post-filter dataset
str(mtcars__gear_is_5)

##### unique values of post-filter dataset$filter
unique(mtcars__gear_is_5$gear)

## regressions where data is filtered by gear

### list of filtered datasets
str(mtcars__gear_is_3)
str(mtcars__gear_is_4)
str(mtcars__gear_is_5)

#### when mtcars__gear_is_3 is dataset used
lm__am_on_mpg__mtcars__gear_is_3 <- lm(mpg ~ am, data=mtcars__gear_is_3)
summary(lm__am_on_mpg__mtcars__gear_is_3)

#### when mtcars__gear_is_4 is dataset used
lm__am_on_mpg__mtcars__gear_is_4 <- lm(mpg ~ am, data=mtcars__gear_is_4)
summary(lm__am_on_mpg__mtcars__gear_is_4)

#### when mtcars__gear_is_5 is dataset used
lm__am_on_mpg__mtcars__gear_is_5 <- lm(mpg ~ am, data=mtcars__gear_is_5)
summary(lm__am_on_mpg__mtcars__gear_is_5)


Solution

  • May be you will be able to achieve you goal with something like this :

    library(data.table)
    dt <- as.data.table(mtcars)
    formulas <- paste0("lm(mpg ~ am, data = dt[gear == ", unique(dt[,gear]), "])" )
    l <- lapply(formulas, function(x) eval(parse(text=x)))
    

    and to see all models, just use :

    l
    

    or to see summary of one of the models :

    summary(lm[[1]])