Search code examples
rdplyrgroup-byinterpolationlm

Combine dplyr group_by and approxfun/approx in R


I have a database containing the number of cells in different growth stages (enlarging and maturing). I only have data for certain days of the year (DOY). Sometimes there were no cells and next time I sampled there were already more than 2 cells. I wanted to know when did every growth stage start (Enlarging/Maturing > 1) and end (Enlarging/Maturing < 1). For that, I just wanted to make an interpolation between the two consecutive DOYs when there were no cells and when there was more than 2. This interpolation needs to have a daily scale (DOY 1-365) so I know which exact day the onset and ending of every growth stage took place. The database looks like this (simplified reproducible example):

df <- data.frame("Year" = c(2012, 2012, 2012, 2012, 2012, 2012, 2012,
                            2012, 2012, 2012, 2013, 2013, 2013,
                            2013, 2013, 2013, 2013, 2013, 2013, 2013),
                 "Tree" = c(15, 15, 15, 15, 15, 22, 22, 22, 22, 22, 41, 41,
                            41, 41, 41, 53, 53, 53, 53, 53),
                 "DOY" = c(65, 97, 125, 177, 214, 65, 97, 125, 177, 214,
                           61, 99, 118, 166, 221, 61, 99, 118, 166, 221),
                 "Enlarging" = c(0, 2, 4, 5, 0, 0, 3, 6, 3, 0, 0, 5, 4, 4, 0, 0, 4, 7, 5, 0),
                 "Maturing" = c(0, 0, 3, 7, 0, 0, 0, 3, 4, 0, 0, 3, 6, 8, 0, 0, 0, 4, 7, 0))

df <- df %>%
  mutate(Year = as.factor(Year),
         Tree = as.factor(Tree),
         DOY = as.numeric(DOY),
         Enlarging = as.numeric(Enlarging),
         Maturing = as.numeric(Maturing))

print(df)
   Year Tree DOY Enlarging Maturing
1  2012   15  65         0        0
2  2012   15  97         2        0
3  2012   15 125         4        3
4  2012   15 177         5        7
5  2012   15 214         0        0
6  2012   22  65         0        0
7  2012   22  97         3        0
8  2012   22 125         6        3
9  2012   22 177         3        4
10 2012   22 214         0        0
11 2013   41  61         0        0
12 2013   41  99         5        3
13 2013   41 118         4        6
14 2013   41 166         4        8
15 2013   41 221         0        0
16 2013   53  61         0        0
17 2013   53  99         4        0
18 2013   53 118         7        4
19 2013   53 166         5        7
20 2013   53 221         0        0

I was planning to use the approxfun or approx functions to do that because I saw they can do what I need, but I have never tried it. The problem is that I have 2 growth stages, different trees sampled and different years of samplings, so I wanted to use the dplyr package (something like group_by(Year, Tree)) because it's easy for me to use. But I have no idea about how to write the syntax of what I want to do. I don't mind using another aproach, like a linear model (lm) + predict to predict the onset (cells > 1) and ending (cells < 1) of every growth stage, but again, I don't know how to do it. I would like to obtain a new dataframe similar to the one above but with the interpolated data, or a dataframe containing the onset and ending DOYs for every growth stage for every Tree, every Year.

Thank you so much in advance if anyone can help me.


Solution

  • Maybe this will work for you. The method is to split the original data frame by year and tree into a list of data frames. Then taking the Day of year and the Enlarging/maturing columns perform an approximately to find the day of year when Enlarging/maturing=0. Below I am using either a linear approximation of cubic spline (commented out). Of course is assuming valid incoming data, no repeated measurements, data is chronological order, etc

    library(dplyr)
    
    onset <- function(x, doy) {
        #Looking for the last 0 day before onset.
        nonzero <- which(x !=0 )
        nonzero <- c((nonzero[1]-1), nonzero) #adds the first non zero term
        
        x <- x[nonzero]
        doy <- doy[nonzero]
        #approximation options
        #linear
         estimate <- approx(x, doy, xout=1)$y
       #cubic spline
       #estimate <- spline(x, doy, xout=1)$y    
       return (estimate)
    }
    
    #split year and tree into a list of data frames
    treelist <-split(df, paste(df$Year, df$Tree))
    
     dfs <- lapply(treelist, function(tree){
        EnlargeOnset <- onset(tree$Enlarging, tree$DOY)
        
        MatureOnset <- onset(tree$Maturing, tree$DOY)
    
        out <- data.frame(Year=tree$Year[1], Tree=tree$Tree[1], EnlargeOnset, MatureOnset)
     })
    answers <- bind_rows(dfs)
     
    #answers
      Year Tree EnlargeOnset MatureOnset
    1 2012   15        81.00   106.33333
    2 2012   22        89.00   106.33333
    3 2013   41        81.25    73.66667
    4 2013   53        70.50   103.75000