Search code examples
rdplyrtidyversedata-analysisnested-tibble

How can I use a function to analyse all the rows in all the tibbles, having my data in a list of tibbles?


I have a list of 106 tibbles, each one contains two columns (date, temperature) with thousands of values.

I tried to create a function that allows me to get the index of the row, in which the temperature is lower than 8.0 four times by tibble.

The problem I am having is that my code, is performing only the first row of every single tibble.

Here you can see the code:

pos_r = 0;
temp =0; 
posx = vector();
for (i in seq_along(data_sensor)){
  if (temp < 4){
    pos_r = pos_r + 1;
  if (data_sensor[[i]]$Temperature < 8.0){
       temp=temp+1;
} else if (temp == 4){
   posx[i] = pos_r;
   i = i+1;
}
}
}



> [1] NA NA NA NA NA NA  5  6 NA  7  8 NA NA  9 NA NA NA 10 11 NA 12 13 14 NA 15 16 17 18 19 NA
 [31] 20 21 22 NA 23 24 25 26 27 NA 28 NA 29 30 NA 31 32 33 34 NA 35 36 37 38 NA 39 40 41 42 43
 [61] 44 NA 45 NA 46 47 48 49 50 51 52 53 54 55 56 57 58 NA NA NA 59 60 61 NA 62 63 NA 64 65 66
 [91] NA 67 NA NA 68 69 70 71 72 73 74 75 76 77 78 79

How can I treat all the rows of every single tibble of the list?


Solution

  • Here's one option: In the code below we use logical tests to find the index of the row for which temperature has been below 8 on four days. Then we use map to implement this method on each data frame in the list.

    library(tidyverse)
    
    # Generate a list of 5 data frames to work with
    set.seed(33)
    dl = replicate(5, tibble(date=seq(as.Date("2021-01-01"), as.Date("2021-02-01"), by="1 day"),
                             temperature = 10 + cumsum(rnorm(length(date), 0, 3))),
                   simplify=FALSE)
    
    # Index of row of fourth day with temperataure lower than 8
    # Run this on the first data frame in the list
    min(which(cumsum(dl[[1]][["temperature"]] < 8) == 4))
    #> [1] 8
    
    # Run the method on each data frame in the list
    # Note that infinity is returned if no data row meets the condition
    idx8 = dl %>% 
      map_dbl(~ min(which(cumsum(.x[["temperature"]] < 8) == 4)))
    
    idx8
    #> [1]   8  29 Inf   7   6
    

    Here are the individual steps illustrated on the first data frame in the list:

    # Logical vector returning TRUE when temperature is less than 8
    dl[[1]][["temperature"]] < 8
    #>  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
    #> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    #> [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    
    # Cumulative number of days where temperature was less than 8
    cumsum(dl[[1]][["temperature"]] < 8) 
    #>  [1] 0 0 0 0 1 2 3 4 4 5 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
    
    # Index of rows for which the cumulative number of days where 
    #  temperature was less than 8 is equal to 4
    which(cumsum(dl[[1]][["temperature"]] < 8) == 4)
    #> [1] 8 9
    
    # We want the index of the first row that meets the condition
    min(which(cumsum(dl[[1]][["temperature"]] < 8) == 4))
    #> [1] 8
    

    Get the indicated row from each data frame, or missing values if there's no row that satisfied the condition. Return the result as a data frame:

    list(dl, idx8) %>% 
      pmap_dfr(~ { 
        if(is.infinite(.y)) {
          tibble(date=NA, temperature=NA)
        } else {
          .x %>% 
            slice(.y) %>% 
            mutate(row.index=.y) %>% 
            relocate(row.index)
        }
      },
      .id="data.frame")
    #> # A tibble: 5 × 4
    #>   data.frame row.index date       temperature
    #>   <chr>          <dbl> <date>           <dbl>
    #> 1 1                  8 2021-01-08       7.12 
    #> 2 2                 29 2021-01-29      -0.731
    #> 3 3                 NA NA              NA    
    #> 4 4                  7 2021-01-07       6.29 
    #> 5 5                  6 2021-01-06       4.58