I have a database containing the number of cells in different growth stages (enlarging and maturing). I only have data for certain days of the year (DOY). Sometimes there were no cells and next time I sampled there were already more than 2 cells. I wanted to know when did every growth stage start (Enlarging/Maturing > 1) and end (Enlarging/Maturing < 1). For that, I just wanted to make an interpolation between the two consecutive DOYs when there were no cells and when there was more than 2. This interpolation needs to have a daily scale (DOY 1-365) so I know which exact day the onset and ending of every growth stage took place. The database looks like this (simplified reproducible example):
df <- data.frame("Year" = c(2012, 2012, 2012, 2012, 2012, 2012, 2012,
2012, 2012, 2012, 2013, 2013, 2013,
2013, 2013, 2013, 2013, 2013, 2013, 2013),
"Tree" = c(15, 15, 15, 15, 15, 22, 22, 22, 22, 22, 41, 41,
41, 41, 41, 53, 53, 53, 53, 53),
"DOY" = c(65, 97, 125, 177, 214, 65, 97, 125, 177, 214,
61, 99, 118, 166, 221, 61, 99, 118, 166, 221),
"Enlarging" = c(0, 2, 4, 5, 0, 0, 3, 6, 3, 0, 0, 5, 4, 4, 0, 0, 4, 7, 5, 0),
"Maturing" = c(0, 0, 3, 7, 0, 0, 0, 3, 4, 0, 0, 3, 6, 8, 0, 0, 0, 4, 7, 0))
df <- df %>%
mutate(Year = as.factor(Year),
Tree = as.factor(Tree),
DOY = as.numeric(DOY),
Enlarging = as.numeric(Enlarging),
Maturing = as.numeric(Maturing))
print(df)
Year Tree DOY Enlarging Maturing
1 2012 15 65 0 0
2 2012 15 97 2 0
3 2012 15 125 4 3
4 2012 15 177 5 7
5 2012 15 214 0 0
6 2012 22 65 0 0
7 2012 22 97 3 0
8 2012 22 125 6 3
9 2012 22 177 3 4
10 2012 22 214 0 0
11 2013 41 61 0 0
12 2013 41 99 5 3
13 2013 41 118 4 6
14 2013 41 166 4 8
15 2013 41 221 0 0
16 2013 53 61 0 0
17 2013 53 99 4 0
18 2013 53 118 7 4
19 2013 53 166 5 7
20 2013 53 221 0 0
I was planning to use the approxfun
or approx
functions to do that because I saw they can do what I need, but I have never tried it. The problem is that I have 2 growth stages, different trees sampled and different years of samplings, so I wanted to use the dplyr
package (something like group_by(Year, Tree)
) because it's easy for me to use. But I have no idea about how to write the syntax of what I want to do. I don't mind using another aproach, like a linear model (lm
) + predict
to predict the onset (cells > 1) and ending (cells < 1) of every growth stage, but again, I don't know how to do it. I would like to obtain a new dataframe similar to the one above but with the interpolated data, or a dataframe containing the onset and ending DOYs for every growth stage for every Tree, every Year.
Thank you so much in advance if anyone can help me.
Maybe this will work for you. The method is to split the original data frame by year and tree into a list of data frames. Then taking the Day of year and the Enlarging/maturing columns perform an approximately to find the day of year when Enlarging/maturing=0. Below I am using either a linear approximation of cubic spline (commented out). Of course is assuming valid incoming data, no repeated measurements, data is chronological order, etc
library(dplyr)
onset <- function(x, doy) {
#Looking for the last 0 day before onset.
nonzero <- which(x !=0 )
nonzero <- c((nonzero[1]-1), nonzero) #adds the first non zero term
x <- x[nonzero]
doy <- doy[nonzero]
#approximation options
#linear
estimate <- approx(x, doy, xout=1)$y
#cubic spline
#estimate <- spline(x, doy, xout=1)$y
return (estimate)
}
#split year and tree into a list of data frames
treelist <-split(df, paste(df$Year, df$Tree))
dfs <- lapply(treelist, function(tree){
EnlargeOnset <- onset(tree$Enlarging, tree$DOY)
MatureOnset <- onset(tree$Maturing, tree$DOY)
out <- data.frame(Year=tree$Year[1], Tree=tree$Tree[1], EnlargeOnset, MatureOnset)
})
answers <- bind_rows(dfs)
#answers
Year Tree EnlargeOnset MatureOnset
1 2012 15 81.00 106.33333
2 2012 22 89.00 106.33333
3 2013 41 81.25 73.66667
4 2013 53 70.50 103.75000