Ok, this question may be a little complex, so let me do my best explaining my scenario.
I have a data frame, similar to the example set below, where I have a predictor column (xpred
) and three response columns (y1:y3
). All have an associated log transformed column as I wish to run a linear model with lm()
using the log transformed data.
suppressWarnings(library(tidyverse))
# Create example data.
set.seed(10)
df <- data.frame(
subject = rep(paste("Subject", LETTERS[1:5]), each = 10),
xpred = rep(1:10, 5),
y1 = sort(runif(10, min = 130, max = 220), decreasing = TRUE),
y2 = sort(runif(10, min = 10, max = 90), decreasing = TRUE),
y3 = sort(runif(10, min = 2, max = 5), decreasing = TRUE)
)
# Log transfrom pred_x:y3 columns.
df <- df %>%
group_by(subject) %>%
mutate(across(.cols = xpred:y3, .fns = ~log(.x), .names = "{col}_log")) %>%
ungroup()
head(df)
#> # A tibble: 6 × 9
#> subject xpred y1 y2 y3 xpred_log y1_log y2_log y3_log
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Subject A 1 192. 76.9 4.59 0 5.26 4.34 1.52
#> 2 Subject A 2 185. 62.1 4.51 0.693 5.22 4.13 1.51
#> 3 Subject A 3 176. 57.7 4.33 1.10 5.17 4.05 1.46
#> 4 Subject A 4 169. 55.4 4.31 1.39 5.13 4.01 1.46
#> 5 Subject A 5 168. 44.3 4.12 1.61 5.13 3.79 1.42
#> 6 Subject A 6 158. 41.9 3.85 1.79 5.06 3.74 1.35
Created on 2024-10-25 with reprex v2.1.1
Now, I'm comfortable running the model (example below) in isolation, but this becomes a bit repetitive when I'd like to run the model on the following combinations of columns for each unique subject in the subject
column: y1_log ~ xpred_log
, y2_log ~ xpred_log
and y3_log ~ xpred_log
.
If I were to do this in isolation, this is what my code would look like for one subject.
# Subset data for one subject.
df_subset <- df %>% filter(subject == "Subject A")
# Run model.
model <- lm(df_subset$y1_log ~ df_subset$xpred_log)
# Extract coefficients for prediction calculation.
c <- unname(exp(model$coefficients["(Intercept)"]))
n <- unname(model$coefficients["df_subset$xpred_log"])
As you'll note in my data, xpred
ranges from 1 to 10 as y1:y3
have been previously calculated for these values. What I'd like to do for this particular project is to complete xpred
so it ranges from 1 to 30 and predict y1:y3
for values 11 to 30 based on the formula c*x^n
were x
is xpred
. For example, if I wanted to predict y1
off a value of 11, it would be c*11^n
.
Hopefully, you're still with me!
So, in essence, I'd like a clean/efficient way to:
Perform lm()
over the aforementioned combinations of log transformed columns in my data frame for each subject. I'm assuming this will rely on some form of group_by()
call in dplyr
.
Use the coefficients (saved as variables c
and n
) to predict y1:y3
based on xpred
values 11 to 30 (already know 1 to 10).
Hoping there is a solution where the final output containing xpred
values 1 to 30 for each subject is stored in a tidy data frame for later analysis.
Appreciate any help with this, and please let me know if anything needs to be clarified as it is my first post.
Here's one approach based on nested tibbles and row-wise operations, a modification of modelling example from dplyr's Row-wise operations vignette.
library(tidyverse)
# pivot longer and then log transform;
# create nested tibble with data for every subject-response combination (grouped rowwise);
# fit models and extract coefficients from all combinations
df_coef <-
df |>
pivot_longer(y1:y3, names_to = "response", values_to = "response_value") |>
mutate(across(.cols = c(xpred, response_value), .fns = ~log(.x), .names = "{col}_log")) |>
nest_by(subject, response) |>
mutate(
lm(response_value_log ~ xpred_log, data = data) |>
coef() |>
bind_rows() |>
rename(`c` = "(Intercept)", n = "xpred_log"),
c = exp(c)
)
# long nested tibble:
df_coef
#> # A tibble: 15 × 5
#> # Rowwise: subject, response
#> subject response data c n
#> <chr> <chr> <list<tibble[,4]>> <dbl> <dbl>
#> 1 Subject A y1 [10 × 4] 200. -0.132
#> 2 Subject A y2 [10 × 4] 101. -0.621
#> 3 Subject A y3 [10 × 4] 5.23 -0.226
#> 4 Subject B y1 [10 × 4] 200. -0.132
#> 5 Subject B y2 [10 × 4] 101. -0.621
#> 6 Subject B y3 [10 × 4] 5.23 -0.226
#> 7 Subject C y1 [10 × 4] 200. -0.132
#> 8 Subject C y2 [10 × 4] 101. -0.621
#> 9 Subject C y3 [10 × 4] 5.23 -0.226
#> 10 Subject D y1 [10 × 4] 200. -0.132
#> 11 Subject D y2 [10 × 4] 101. -0.621
#> 12 Subject D y3 [10 × 4] 5.23 -0.226
#> 13 Subject E y1 [10 × 4] 200. -0.132
#> 14 Subject E y2 [10 × 4] 101. -0.621
#> 15 Subject E y3 [10 × 4] 5.23 -0.226
# first data tibble (SubjectA, response y1)
df_coef$data[[1]]
#> # A tibble: 10 × 4
#> xpred response_value xpred_log response_value_log
#> <int> <dbl> <dbl> <dbl>
#> 1 1 192. 0 5.26
#> 2 2 185. 0.693 5.22
#> 3 3 176. 1.10 5.17
#> 4 4 169. 1.39 5.13
#> 5 5 168. 1.61 5.13
#> 6 6 158. 1.79 5.06
#> 7 7 155. 1.95 5.04
#> 8 8 155. 2.08 5.04
#> 9 9 150. 2.20 5.01
#> 10 10 138. 2.30 4.92
# replace nested data tibbles with calculated values for xpred = 11:30;
# revrese nesting & pivoting
df_coef |>
mutate(data = tibble(xpred = 11:30, response_value = c*xpred^n) |> list(), .keep = "unused") |>
unnest(data) |>
pivot_wider(names_from = response, values_from = response_value)
Results:
#> # A tibble: 100 × 5
#> # Groups: subject [5]
#> subject xpred y1 y2 y3
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Subject A 11 146. 22.8 3.04
#> 2 Subject A 12 144. 21.6 2.98
#> 3 Subject A 13 142. 20.5 2.93
#> 4 Subject A 14 141. 19.6 2.88
#> 5 Subject A 15 140. 18.8 2.84
#> 6 Subject A 16 139. 18.0 2.80
#> 7 Subject A 17 137. 17.4 2.76
#> 8 Subject A 18 136. 16.8 2.72
#> 9 Subject A 19 135. 16.2 2.69
#> 10 Subject A 20 135. 15.7 2.66
#> # ℹ 90 more rows
Example data:
# Create example data.
set.seed(10)
df <- data.frame(
subject = rep(paste("Subject", LETTERS[1:5]), each = 10),
xpred = rep(1:10, 5),
y1 = sort(runif(10, min = 130, max = 220), decreasing = TRUE),
y2 = sort(runif(10, min = 10, max = 90), decreasing = TRUE),
y3 = sort(runif(10, min = 2, max = 5), decreasing = TRUE)
)