Search code examples
rgroup-bymodelpredictiongam

How to apply a GAM function by group for every individual


I have a tree growth database containing the number of cells in every growth stage (enlarging, thickening, mature) for every DOY (Day of the year) for every tree and every year (6 different trees per year, 10 years, 60 trees in total). The database looks like this (simplified):

 Year Tree DOY Enlarging Thickening Mature
  2012  25  80    0         0          0
  2012  25  87    1         0          0
  2012  25  94    4         0          0
  2012  25  103   5         1          0
  2012  25  111   3         3          0
  2012  25  119   1         4          1
  2012  25  127   1         5          3
  2012  30  80    0         0          0
  2012  30  87    2         0          0
  2012  30  94    5         1          0
  2012  30  103   7         3          1
  2012  30  111   4         6          2
  2012  30  119   3         7          5
  2012  30  127   1         8          7
  2012  43  80    0         0          0
  2012  43  87    0         0          0
  2012  43  94    2         0          0
  etc.

I would like to apply a GAM function to obtain predictions about when does every growth stage start and finish for every tree, every year, and also understand the growth curves and pattern every stage follows. The model I use it's just a simple GAM between every growth phase number of cells (enlarging, thickening, mature) and the day of the year it occurs:

Enlarging <- gam(Enlarging ~ s(DOY), data=datosSTD, quasipoisson, gamma=1, min.sp=0.01)  
Thickening <- gam(Thickening ~ s(DOY), data=datosSTD, quasipoisson, gamma=1, min.sp=0.01)  
Mature <-gam(Mature ~ s(DOY), data=datosSTD, quasipoisson, gamma=1, min.sp=0.01)

My problem is that I am unable to apply this GAM model to every individual tree, or even apply it for the yearly average of trees. For example, I tried using dplyr with the enlarging phase:


    Enlarging <- df %>%
      group_by(Tree, Year)%>%
      do(gam_enlarging = gam(Enlarging ~ s(DOY), data = ., quasipoisson, gamma = 1, min.sp = 0.01))%>%
      ungroup

which gives me a list with the coefficients, residuals and fitted values among other data. But when I try to obtain the fitted values of the model for every tree I get an error:

fitted.enlarging <- data.frame(c(sapply(Enlarging$gam_enlarging, fitted)))

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 19, 18, 23, 15, 26, 22, 29, 21

I tried using the function predict, but I get a different error:

predicted.enlarging <- as.vector(predict(Enlarging,  data.frame(DOY),type="response"))

Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

I was thinking about using a loop, but honestly I don't think I have the knowledge to do that yet. I'm still learning basic R. I just need to know when does every growth phase start (nº Enlarging cells>1) and ends (nº Enlarging cells<0) for every tree, every year, and be able to plot a representation of the growth stage curve pattern. If possible, I would like to use the dplyr package to group my data and process it since it's the package I'm more familiar with.


Solution

  • A set of dataframe rearrangements with {dplyr} and {tidyr} allow you to group your data, nest them (i. e. populate a dataframe 'cell' with subsets of data), apply arbitrary functions (like gam), and finally unnest the thing again.

    It's probably best explained with an example. Using your data (in that case only one combination of Tree and Year, but use it on your whole data as desired):

    your example data:

    df <- structure(list(Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 
    2012L), Tree = c(25L, 25L, 25L, 25L, 25L, 25L, 25L), DOY = c(80L, 
    87L, 94L, 103L, 111L, 119L, 127L), Enlarging = c(0L, 1L, 4L, 
    5L, 3L, 1L, 1L), Thickening = c(0L, 0L, 0L, 1L, 3L, 4L, 5L), 
        Mature = c(0L, 0L, 0L, 0L, 0L, 1L, 3L)), class = "data.frame", row.names = c(NA, 
    7L))
    

    required libraries:

    library(dplyr)
    library(tidyr)
    library(gam)
    
    • reshape the data from wide to long format, i. e. stack Count by growth status groups:
    df_long <- 
      df |>
      pivot_longer(Enlarging:Mature,
                   names_to = 'Growth_Phase',
                   values_to = 'Count'
                   )
    
    #> df_long |> head(3)
    # A tibble: 3 x 5
       Year  Tree   DOY Growth_Phase Count
      <int> <int> <int> <chr>        <int>
    1  2012    25    80 Enlarging        0
    2  2012    25    80 Thickening       0
    3  2012    25    80 Mature           0
    
    • nest by Year, Tree and Growth_Phase:
    df_nested <- 
      df_long |>
      nest_by(Year, Tree, Growth_Phase)
    

    column data of the nested dataframe contains one dataframe per row (= unique combination of Year x Tree x Growth_Phase):

    # A tibble: 3 x 4
    # Rowwise:  Year, Tree, Growth_Phase
       Year  Tree Growth_Phase               data
      <int> <int> <chr>        <list<tibble[,2]>>
    1  2012    25 Enlarging               [7 x 2]
    2  2012    25 Mature                  [7 x 2]
    3  2012    25 Thickening              [7 x 2]
    
    • fit a GAM per combination (remember to return the GAM or any other not atomic result as a list, so it can be accomodated by a single cell of the main dataframe):
    df_with_gam <- 
      df_nested |>
      rowwise() |>
      mutate(the_gam = list(
               gam(Count ~ s(DOY), 
                   data = data, quasipoisson, gamma = 1, min.sp = 0.01)  
             ),
             Fitted = list(predict(the_gam, type = 'response'))
             )
    
    > df_with_gam
    # A tibble: 3 x 6
    # Rowwise: 
       Year  Tree Growth_Phase               data the_gam Fitted   
      <int> <int> <chr>        <list<tibble[,2]>> <list>  <list>   
    1  2012    25 Enlarging               [7 x 2] <Gam>   <dbl [7]>
    2  2012    25 Mature                  [7 x 2] <Gam>   <dbl [7]>
    3  2012    25 Thickening              [7 x 2] <Gam>   <dbl [7]>
    
    • select and unnest the desired columns again:
    df_with_gam |>
      select(Year:Growth_Phase, data, Fitted) |>
      unnest_longer(c('data', 'Fitted')) |>
      unnest_wider(data)
    
    + # A tibble: 21 x 7
        Year  Tree Growth_Phase   DOY Count   Fitted Fitted_id
       <int> <int> <chr>        <int> <int>    <dbl> <chr>    
     1  2012    25 Enlarging       80     0 1.30e- 1 1        
     2  2012    25 Enlarging       87     1 9.71e- 1 2        
     3  2012    25 Enlarging       94     4 3.79e+ 0 3        
     4  2012    25 Enlarging      103     5 5.07e+ 0 4        
     5  2012    25 Enlarging      111     3 2.87e+ 0 5        
     6  2012    25 Enlarging      119     1 1.26e+ 0 6        
     7  2012    25 Enlarging      127     1 8.95e- 1 7        
     8  2012    25 Mature          80     0 5.94e-11 1        
     9  2012    25 Mature          87     0 3.23e-13 2        
    10  2012    25 Mature          94     0 4.53e-15 3        
    # ... with 11 more rows
    
    • filter, pivot_wider, arrange (= sort) etc. as desired ...

    • quick control, df_result being your resulting dataframe after above manipulations:

    library(ggplot2)
    
    df_result |>
      ggplot() +
         geom_point(aes(Count, Fitted)) +
         facet_wrap(~ Growth_Phase)
    

    enter image description here