Search code examples
rlinear-regression

How to compute multiple linear models over combinations of columns in data frame?


Ok, this question may be a little complex, so let me do my best explaining my scenario.

I have a data frame, similar to the example set below, where I have a predictor column (xpred) and three response columns (y1:y3). All have an associated log transformed column as I wish to run a linear model with lm() using the log transformed data.

suppressWarnings(library(tidyverse))

# Create example data. 
set.seed(10)
df <- data.frame(
  subject = rep(paste("Subject", LETTERS[1:5]), each = 10),
  xpred = rep(1:10, 5),
  y1 = sort(runif(10, min = 130, max = 220), decreasing = TRUE),
  y2 = sort(runif(10, min = 10, max = 90), decreasing = TRUE),
  y3 = sort(runif(10, min = 2, max = 5), decreasing = TRUE)
  )

# Log transfrom pred_x:y3 columns.
df <- df %>%
  group_by(subject) %>%
  mutate(across(.cols = xpred:y3, .fns = ~log(.x), .names = "{col}_log")) %>%
  ungroup()

head(df)
#> # A tibble: 6 × 9
#>   subject   xpred    y1    y2    y3 xpred_log y1_log y2_log y3_log
#>   <chr>     <int> <dbl> <dbl> <dbl>     <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Subject A     1  192.  76.9  4.59     0       5.26   4.34   1.52
#> 2 Subject A     2  185.  62.1  4.51     0.693   5.22   4.13   1.51
#> 3 Subject A     3  176.  57.7  4.33     1.10    5.17   4.05   1.46
#> 4 Subject A     4  169.  55.4  4.31     1.39    5.13   4.01   1.46
#> 5 Subject A     5  168.  44.3  4.12     1.61    5.13   3.79   1.42
#> 6 Subject A     6  158.  41.9  3.85     1.79    5.06   3.74   1.35

Created on 2024-10-25 with reprex v2.1.1

Now, I'm comfortable running the model (example below) in isolation, but this becomes a bit repetitive when I'd like to run the model on the following combinations of columns for each unique subject in the subject column: y1_log ~ xpred_log, y2_log ~ xpred_log and y3_log ~ xpred_log.

If I were to do this in isolation, this is what my code would look like for one subject.

# Subset data for one subject.
df_subset <- df %>% filter(subject == "Subject A")

# Run model.
model <- lm(df_subset$y1_log ~ df_subset$xpred_log)

# Extract coefficients for prediction calculation.
c <- unname(exp(model$coefficients["(Intercept)"]))
n <- unname(model$coefficients["df_subset$xpred_log"])

As you'll note in my data, xpred ranges from 1 to 10 as y1:y3 have been previously calculated for these values. What I'd like to do for this particular project is to complete xpred so it ranges from 1 to 30 and predict y1:y3 for values 11 to 30 based on the formula c*x^n were x is xpred. For example, if I wanted to predict y1 off a value of 11, it would be c*11^n.

Hopefully, you're still with me!

So, in essence, I'd like a clean/efficient way to:

  1. Perform lm() over the aforementioned combinations of log transformed columns in my data frame for each subject. I'm assuming this will rely on some form of group_by() call in dplyr.

  2. Use the coefficients (saved as variables c and n) to predict y1:y3 based on xpred values 11 to 30 (already know 1 to 10).

Hoping there is a solution where the final output containing xpred values 1 to 30 for each subject is stored in a tidy data frame for later analysis.

Appreciate any help with this, and please let me know if anything needs to be clarified as it is my first post.


Solution

  • Here's one approach based on nested tibbles and row-wise operations, a modification of modelling example from dplyr's Row-wise operations vignette.

    library(tidyverse)
    
    # pivot longer and then log transform;
    # create nested tibble with data for every subject-response combination (grouped rowwise);
    # fit models and extract coefficients from all combinations
    df_coef <- 
      df |> 
      pivot_longer(y1:y3, names_to = "response", values_to = "response_value") |> 
      mutate(across(.cols = c(xpred, response_value), .fns = ~log(.x), .names = "{col}_log")) |> 
      nest_by(subject, response) |> 
      mutate(
        lm(response_value_log ~ xpred_log, data = data) |> 
          coef() |> 
          bind_rows() |> 
          rename(`c` = "(Intercept)", n = "xpred_log"),
        c = exp(c)
        )
    # long nested tibble:
    df_coef
    #> # A tibble: 15 × 5
    #> # Rowwise:  subject, response
    #>    subject   response               data      c      n
    #>    <chr>     <chr>    <list<tibble[,4]>>  <dbl>  <dbl>
    #>  1 Subject A y1                 [10 × 4] 200.   -0.132
    #>  2 Subject A y2                 [10 × 4] 101.   -0.621
    #>  3 Subject A y3                 [10 × 4]   5.23 -0.226
    #>  4 Subject B y1                 [10 × 4] 200.   -0.132
    #>  5 Subject B y2                 [10 × 4] 101.   -0.621
    #>  6 Subject B y3                 [10 × 4]   5.23 -0.226
    #>  7 Subject C y1                 [10 × 4] 200.   -0.132
    #>  8 Subject C y2                 [10 × 4] 101.   -0.621
    #>  9 Subject C y3                 [10 × 4]   5.23 -0.226
    #> 10 Subject D y1                 [10 × 4] 200.   -0.132
    #> 11 Subject D y2                 [10 × 4] 101.   -0.621
    #> 12 Subject D y3                 [10 × 4]   5.23 -0.226
    #> 13 Subject E y1                 [10 × 4] 200.   -0.132
    #> 14 Subject E y2                 [10 × 4] 101.   -0.621
    #> 15 Subject E y3                 [10 × 4]   5.23 -0.226
    
    # first data tibble (SubjectA, response y1)
    df_coef$data[[1]]
    #> # A tibble: 10 × 4
    #>    xpred response_value xpred_log response_value_log
    #>    <int>          <dbl>     <dbl>              <dbl>
    #>  1     1           192.     0                   5.26
    #>  2     2           185.     0.693               5.22
    #>  3     3           176.     1.10                5.17
    #>  4     4           169.     1.39                5.13
    #>  5     5           168.     1.61                5.13
    #>  6     6           158.     1.79                5.06
    #>  7     7           155.     1.95                5.04
    #>  8     8           155.     2.08                5.04
    #>  9     9           150.     2.20                5.01
    #> 10    10           138.     2.30                4.92
    
    # replace nested data tibbles with calculated values for xpred = 11:30;
    # revrese nesting & pivoting
    df_coef |> 
      mutate(data = tibble(xpred = 11:30, response_value = c*xpred^n) |> list(), .keep = "unused") |> 
      unnest(data) |> 
      pivot_wider(names_from = response, values_from = response_value)
    

    Results:

    #> # A tibble: 100 × 5
    #> # Groups:   subject [5]
    #>    subject   xpred    y1    y2    y3
    #>    <chr>     <int> <dbl> <dbl> <dbl>
    #>  1 Subject A    11  146.  22.8  3.04
    #>  2 Subject A    12  144.  21.6  2.98
    #>  3 Subject A    13  142.  20.5  2.93
    #>  4 Subject A    14  141.  19.6  2.88
    #>  5 Subject A    15  140.  18.8  2.84
    #>  6 Subject A    16  139.  18.0  2.80
    #>  7 Subject A    17  137.  17.4  2.76
    #>  8 Subject A    18  136.  16.8  2.72
    #>  9 Subject A    19  135.  16.2  2.69
    #> 10 Subject A    20  135.  15.7  2.66
    #> # ℹ 90 more rows
    

    Example data:

    # Create example data. 
    set.seed(10)
    df <- data.frame(
      subject = rep(paste("Subject", LETTERS[1:5]), each = 10),
      xpred = rep(1:10, 5),
      y1 = sort(runif(10, min = 130, max = 220), decreasing = TRUE),
      y2 = sort(runif(10, min = 10, max = 90), decreasing = TRUE),
      y3 = sort(runif(10, min = 2, max = 5), decreasing = TRUE)
    )