r linear-regression data-modeling prediction

Predicting when an output might happen in time in R

I have a dataset of peoples blood results in R:

        ID Result date    
        A1 80     01/01/2006
        A1 70     01/01/2009
        A1 61     01/01/2010
        A1 30     01/01/2018
        A1 28     01/01/2022
        B2 40     01/01/2006
        B2 30     01/01/2012
        B2 25     01/01/2015
        B2 10     01/01/2020
        G3 28     01/01/2009
        G3 27     01/01/2014
        G3 25     01/01/2013
        G3 24     01/01/2011
        G3 22     01/01/2019

I would like to plot these and then do a linear regression and find the predicted date at which their blood result would hit 15 for each person. In reality there are multiple blood tests per year so I would try and get the median result per year and then plot those and get the predicted year they hit 15.

I can do the median per year and use the lm function along the lines of:

filtered$date_of_bloods <-format(filtered$date_of_bloods,format="%Y")
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))

#aggregate median results per year 
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)

# do lm per ID cohort and get slope of lines 
g<- as.data.frame(medians)
coefLM <- function(x) {coef(lm(date_of_bloods ~ results, data = x))}
coefs<- sapply(g, coefLM)

But I am unsure who to extract and get the prediction for when the slope of the line would hit 15.

Any help would be much appreciated

Solution

Here's a base R answer:

filtered <- read.table(text = "ID results date_of_bloods    
                A1 80     01/01/2006
                A1 70     01/01/2009
                A1 61     01/01/2010
                A1 30     01/01/2018
                A1 28     01/01/2022
                B2 40     01/01/2006
                B2 30     01/01/2012
                B2 25     01/01/2015
                B2 10     01/01/2020
                G3 28     01/01/2009
                G3 27     01/01/2014
                G3 25     01/01/2013
                G3 24     01/01/2011
                G3 22     01/01/2019", header = TRUE)
filtered$date_of_bloods <- as.Date(filtered$date_of_bloods, '%m/%d/%Y')

filtered$date_of_bloods <- format(filtered$date_of_bloods,format="%Y") |> 
  as.integer()
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))

#aggregate median results per year 
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)

# do lm per ID cohort and get slope of lines 
g <- as.data.frame(medians)
coefLM <- function(x) {coef(lm(results ~ date_of_bloods, data = x))}
coefs<- sapply(g, coefLM)  

coefs <- 
  rbind(coefs, 
        year_for_15 = (15 - coefs["(Intercept)", ]) / coefs["date_of_bloods", ])

Giving:

coefs
                     A1          B2           G3
(Intercept)    6998.650 4263.381995  953.8239437
date_of_bloods   -3.450   -2.104623   -0.4612676
year_for_15    2024.246 2018.595376 2035.3129771

I changed your code for the year to make it an integer, that way the coefLM function can work in the normal dependent variable ~ independent variable order. Then it's just y = mx + b implies x0 = (15 - b) / m.