Search code examples
rdatatable

Finding first iteration of a string in a datatable in R


I am pretty new to R so I was trying to figure out how can I do this better. I have a data table that consist of two columns, (Day and Sleepstatus). How can I find the first iteration of Sleeping and Awake based on the column day and mutate another column to indicate when the person start sleep (1st row of sleeping) and stop sleep (1st row of awake). The rest of the duration of sleeping, the column should show N.A.

Day SleepStatus
1 Sleeping
1 Sleeping
1 Sleeping
1 Awake
2 Sleeping
2 Sleeping
2 Sleeping
2 Awake

Desired Output

Day SleepStatus Final Status
1 Sleeping Start Sleep
1 Sleeping NA
1 Sleeping Stop Sleep
1 Awake NA
2 Sleeping Start Sleep
2 Sleeping NA
2 Sleeping Stop Sleep
2 Awake NA

Solution

  • This is one potential solution:

    library(data.table)
    
    dt <- data.table::data.table(
              Day = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
      SleepStatus = c("Sleeping","Sleeping","Sleeping",
                      "Awake","Sleeping","Sleeping","Sleeping","Awake")
    )
    
    dt[, `Final Status` := {ifelse(
      cumsum(SleepStatus != "Sleeping") != shift(cumsum(SleepStatus != "Sleeping"), fill = 0, type = "lag"),
      "Stop Sleep", "Start Sleep")}]
    dt[, `Final Status` := {ifelse(
      `Final Status` == shift(`Final Status`, fill = "NA", type = "lag"),
      NA, `Final Status`)}]
    dt
    #>    Day SleepStatus Final Status
    #> 1:   1    Sleeping  Start Sleep
    #> 2:   1    Sleeping         <NA>
    #> 3:   1    Sleeping         <NA>
    #> 4:   1       Awake   Stop Sleep
    #> 5:   2    Sleeping  Start Sleep
    #> 6:   2    Sleeping         <NA>
    #> 7:   2    Sleeping         <NA>
    #> 8:   2       Awake   Stop Sleep
    

    The code makes a lot more sense if you break it down into smaller chunks. I've done this with tidyverse functions below, as I feel it's easier to understand, but I can change it to data.table syntax if you would like me to.

    library(data.table)
    
    dt <- data.table::data.table(
              Day = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
      SleepStatus = c("Sleeping","Sleeping","Sleeping",
                      "Awake","Sleeping","Sleeping","Sleeping","Awake")
    )
    
    library(tidyverse)
    df <- as.data.frame(dt)
    
    # When the Sleepstatus is not "Sleeping", increment the variable by one
    df2 <- df %>%
      mutate(Sleeping = cumsum(SleepStatus != "Sleeping"))
    df2
    #>   Day SleepStatus Sleeping
    #> 1   1    Sleeping        0
    #> 2   1    Sleeping        0
    #> 3   1    Sleeping        0
    #> 4   1       Awake        1
    #> 5   2    Sleeping        1
    #> 6   2    Sleeping        1
    #> 7   2    Sleeping        1
    #> 8   2       Awake        2
    
    # If the previous value in "Sleeping" is different to the current value,
    # add the "stop sleeping" flag (i.e. show when "Sleeping" changes)
    df3 <- df2 %>%
      mutate(Sleep_label = ifelse(Sleeping != lag(Sleeping, default = 0), "Stop sleeping", "Start sleeping"))
    df3
    #>   Day SleepStatus Sleeping    Sleep_label
    #> 1   1    Sleeping        0 Start sleeping
    #> 2   1    Sleeping        0 Start sleeping
    #> 3   1    Sleeping        0 Start sleeping
    #> 4   1       Awake        1  Stop sleeping
    #> 5   2    Sleeping        1 Start sleeping
    #> 6   2    Sleeping        1 Start sleeping
    #> 7   2    Sleeping        1 Start sleeping
    #> 8   2       Awake        2  Stop sleeping
    
    # Then, if the value in Sleep_label is equal to the previous label,
    # change it to NA
    df4 <- df3 %>%
      mutate(Final_status = ifelse(Sleep_label == lag(Sleep_label, default = "NA"), NA, Sleep_label))
    df4
    #>   Day SleepStatus Sleeping    Sleep_label   Final_status
    #> 1   1    Sleeping        0 Start sleeping Start sleeping
    #> 2   1    Sleeping        0 Start sleeping           <NA>
    #> 3   1    Sleeping        0 Start sleeping           <NA>
    #> 4   1       Awake        1  Stop sleeping  Stop sleeping
    #> 5   2    Sleeping        1 Start sleeping Start sleeping
    #> 6   2    Sleeping        1 Start sleeping           <NA>
    #> 7   2    Sleeping        1 Start sleeping           <NA>
    #> 8   2       Awake        2  Stop sleeping  Stop sleeping
    

    Created on 2022-05-20 by the reprex package (v2.0.1)

    Does that make sense? Or did I just make things more confusing?