Search code examples
rdplyrsurvival-analysisdata-transform

Seed germination data: converting time to event data from short form to long form for survival analysis


I am evaluating seedling emergence rates using survival analysis and I would like to automate the process of converting the short form collected data into the long form for analysis in R.

Here is an example of the collected data format and the date conversion:

prac.dat <- tribble(
  ~ID, ~ImbibtionStartDate, ~Survey1date, ~Survey1totalcounts, ~Survey2date, ~Survey2totalcounts,~Survey3date, ~Survey3totalcounts, ~Total_sown_seeds,
  #--/--------------------/-------------/--------------------/-------------/------------------/---------------/------------------/-----------------/
  "ID1", "3/22/2022 14:20","3/24/2022 16:45", 0, "3/25/2022 16:00", 8, "3/26/2022 13:00", 21, 25,
  "ID2", "3/22/2022 14:20","3/24/2022 16:45", 1, "3/25/2022 16:00", 4, "3/26/2022 13:00", 11, 25,
)

prac.dat <- prac.dat %>% 
  mutate(ImbibtionStartDate=as.POSIXct(ImbibtionStartDate, format="%m/%d/%Y %H:%M"),
         Survey1date=as.POSIXct(Survey1date, format="%m/%d/%Y %H:%M"),
         Survey2date=as.POSIXct(Survey2date, format="%m/%d/%Y %H:%M"),
         Survey3date=as.POSIXct(Survey3date, format="%m/%d/%Y %H:%M"))

In this data set, "ID" is the identity of the pot where seeds were sown, "ImbibtionStartDate" is the date and time when seeds in the soil were first watered, "Survey1date" [and other survey date columns] are the date and time a survey was conducted to count total seedling emergents, "Survey1totalcounts" [and other survey count columns] indicate the cumulative number of seedlings that have emerged in that pot by that survey date, and "Total_sown_seeds" indicates the total number of seeds that were sown in a pot.

I aiming for a data set that 1) generates a row for every seed in every pot (pot identification is represented by the "ID" column), 2) indicates whether the seed emerged ("1") or did not emerge ("0") over the course of the study period, and 3) calculates the specific time it took for each seed to emerge (estimated by difference between the Survey date and time when the seedling was first spotted and the imbibtion start date and time).

I would like the final output to look something like this:

desired.output <- tribble(
  ~ID, ~Emg_Poa, ~time_to_emg,
  #Unique Id for each Seed/
  #whether that seed emerged ("1") or not ("0") by the final survey date/
  #days it took for that seed to emerge from imbibtion start to survey date/
  "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA,
  "ID2",1, 2.10, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94,
  "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",0, NA, "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,
  "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA, "ID2",0, NA, "ID2",0, NA, "ID2",0, NA
  )

To date, I've done these conversions by hand from one excel into another, but in the interest of minimizing errors and saving time, I am curious if anyone would be willing to propose a method of automating this process in R. This task is beyond my current functional capacity in R data frame generation. Thank you for your time, consideration, and input.


Solution

  • Getting from prac.dat to your desired output is a bit tricky, but certainly possible. First, let's get prac.dat into "long" format and calculate a few useful columns:

    prac.long <- prac.dat %>% 
      pivot_longer(matches('counts|Survey.*date'), names_to = c('survey_num', '.value'), names_pattern = 'Survey(\\d)(.*)') %>% 
      rename(survey_date = date, count = totalcounts) %>% 
      group_by(ID) %>% 
      mutate(
        across(c(ImbibtionStartDate, survey_date), ~as.POSIXct(., format="%m/%d/%Y %H:%M")),
        not_emerged = Total_sown_seeds - max(count),
        time_to_emerge = survey_date - ImbibtionStartDate,
        emerged_at_survey = count - lag(count),
        emerged_at_survey = ifelse(is.na(emerged_at_survey), count[1], emerged_at_survey)
      ) 
    
      ID    ImbibtionStartDate  Total_sown_seeds survey_num survey_date         count not_emerged
      <chr> <dttm>                         <dbl> <chr>      <dttm>              <dbl>       <dbl>
    1 ID1   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     0           4
    2 ID1   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     8           4
    3 ID1   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    21           4
    4 ID2   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     1          14
    5 ID2   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     4          14
    6 ID2   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    11          14
    # … with 2 more variables: time_to_emerge <drtn>, emerged_at_survey <dbl>
    

    We also need to calculate counts of seeds that did not emerge:

    prac.unemerged <- select(prac.long, ID, not_emerged) %>% 
      distinct %>% 
      mutate(time_to_emerge = NA) %>% 
      rename(count = not_emerged)
    
      ID    count time_to_emerge
      <chr> <dbl> <lgl>         
    1 ID1       4 NA            
    2 ID2      14 NA  
    

    Finally, we combine the counts of emerged seeds and their time to germination with the data.unemerged, and use uncount to expand to your desired output:

    result <- select(prac.long, ID, time_to_emerge, count = emerged_at_survey) %>% 
      bind_rows(prac.unemerged) %>% 
      uncount(weights = count) %>% 
      mutate(Emg_poa = as.numeric(!is.na(time_to_emerge))) %>% 
      arrange(ID, time_to_emerge)
    
       ID    time_to_emerge Emg_poa
       <chr> <drtn>           <dbl>
     1 ID1   3.069444 days        1
     2 ID1   3.069444 days        1
     3 ID1   3.069444 days        1
     4 ID1   3.069444 days        1
     5 ID1   3.069444 days        1
     6 ID1   3.069444 days        1
     7 ID1   3.069444 days        1
     8 ID1   3.069444 days        1
     9 ID1   3.944444 days        1
    10 ID1   3.944444 days        1