Search code examples
rlinear-regressiondata-modelingprediction

Predicting when an output might happen in time in R


I have a dataset of peoples blood results in R:

        ID Result date    
        A1 80     01/01/2006
        A1 70     01/01/2009
        A1 61     01/01/2010
        A1 30     01/01/2018
        A1 28     01/01/2022
        B2 40     01/01/2006
        B2 30     01/01/2012
        B2 25     01/01/2015
        B2 10     01/01/2020
        G3 28     01/01/2009
        G3 27     01/01/2014
        G3 25     01/01/2013
        G3 24     01/01/2011
        G3 22     01/01/2019

I would like to plot these and then do a linear regression and find the predicted date at which their blood result would hit 15 for each person. In reality there are multiple blood tests per year so I would try and get the median result per year and then plot those and get the predicted year they hit 15.

I can do the median per year and use the lm function along the lines of:

filtered$date_of_bloods <-format(filtered$date_of_bloods,format="%Y")
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))

#aggregate median results per year 
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)

# do lm per ID cohort and get slope of lines 
g<- as.data.frame(medians)
coefLM <- function(x) {coef(lm(date_of_bloods ~ results, data = x))}
coefs<- sapply(g, coefLM)

But I am unsure who to extract and get the prediction for when the slope of the line would hit 15.

Any help would be much appreciated


Solution

  • Here's a base R answer:

    filtered <- read.table(text = "ID results date_of_bloods    
                    A1 80     01/01/2006
                    A1 70     01/01/2009
                    A1 61     01/01/2010
                    A1 30     01/01/2018
                    A1 28     01/01/2022
                    B2 40     01/01/2006
                    B2 30     01/01/2012
                    B2 25     01/01/2015
                    B2 10     01/01/2020
                    G3 28     01/01/2009
                    G3 27     01/01/2014
                    G3 25     01/01/2013
                    G3 24     01/01/2011
                    G3 22     01/01/2019", header = TRUE)
    filtered$date_of_bloods <- as.Date(filtered$date_of_bloods, '%m/%d/%Y')
    
    filtered$date_of_bloods <- format(filtered$date_of_bloods,format="%Y") |> 
      as.integer()
    #split into individual ID groups
    a <- with(filtered, split(filtered, list(ID)))
    
    #aggregate median results per year 
    medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
    medians <- sapply(a, medianfunc)
    
    # do lm per ID cohort and get slope of lines 
    g <- as.data.frame(medians)
    coefLM <- function(x) {coef(lm(results ~ date_of_bloods, data = x))}
    coefs<- sapply(g, coefLM)  
    
    coefs <- 
      rbind(coefs, 
            year_for_15 = (15 - coefs["(Intercept)", ]) / coefs["date_of_bloods", ])
    

    Giving:

    coefs
                         A1          B2           G3
    (Intercept)    6998.650 4263.381995  953.8239437
    date_of_bloods   -3.450   -2.104623   -0.4612676
    year_for_15    2024.246 2018.595376 2035.3129771
    

    I changed your code for the year to make it an integer, that way the coefLM function can work in the normal dependent variable ~ independent variable order. Then it's just y = mx + b implies x0 = (15 - b) / m.