Search code examples
rdplyrdata-sciencedata-analysisdata-wrangling

How to calculate the pre and post values ​each time an event appears on a dataframe in R


I have a database like this example here, and I need to calculate the average, the sum and the standard deviation of the 10 minutes prior to the event and also of the 10 minutes from the event. The numbers in dataframes are just to illustrate.

df = data.frame(seq(ISOdatetime(2020,5,4,0,0,0), ISOdatetime(2020,5,4,0,19,0), by=(60*1)))
colnames(df) = "date"

df$results = c(18, 25, 35, 96, 100, 50, 48, 67, 36, 98, 46, 75, 67, 36, 98, 46, 75, 67, 36, 98)
df$events = c(NA, NA, NA, NA, NA, NA, "event A", NA, NA, "event B", NA, "event C", NA, NA, NA, NA, NA, NA, "event D", NA)

I would like to join these calculations in a new data frame, containing, the event name and date, and the values ​​of average, sum and standard deviation in the pre and also in the later moment. Thank you for any help.

Edit: I would like to make the previous values ​​(of sum, average and standard deviation) to the event (that is, the 10 minutes immediately preceding), and the subsequent values ​​from of the event (that is, if the event happened 10:01, I would like to take the "01" + nine minutes ahead. So, I need separate the values in "avg.pre", "sd.pre", "sum.pre" and "avg.pos", "sd.pos" and "sum.pos"

new.df = data.frame(c("event A", "event B", "event C", "event D")); colnames(new.df) = "events"
new.df$date.pre = c("2020-05-04 00:06:00", "2020-05-04 00:09:00", " 2020-05-04 00:11:00", " 2020-05-04 00:18:00")
new.df$avg.pre = c(53.14, 52.77...)
new.df$sd.pre = c(32.72, 31.18,...)
new.df$sum.pre = c(372, 475, 68,...)
new.df$avg.pos = c(61.71, 64.4, 25,...)
new.df$sd.pos = c(23.26, 23.06, 46,...)
new.df$sum.pos = c(617, 644, 68,...)

Solution

  • You can try this approach filter data for 10 mins before and after each event :

    library(dplyr)
    
    purrr::map_df(unique(na.omit(df$evets)), function(x) {
      i <- df$date[df$evets == x & !is.na(df$evets)]
      
      bind_cols(data.frame(events = x,
                           date = i), df %>%
        filter(between(date, i - 10*60, i)) %>%
        summarise(pre.avg = mean(results, na.rm = TRUE), 
                  pre.sd = sd(results, na.rm = TRUE), 
                  pre.sum = sum(results, na.rm = TRUE)),
        df %>%
        filter(between(date, i, i + 10*60)) %>%
        summarise(post.avg = mean(results, na.rm = TRUE), 
                  post.sd = sd(results, na.rm = TRUE), 
                  post.sum = sum(results, na.rm = TRUE)))
    })
    
    # events                date  pre.avg   pre.sd pre.sum post.avg  post.sd post.sum
    #1 event A 2020-05-04 00:06:00 53.14286 32.72323     372 62.90909 22.42969      692
    #2 event B 2020-05-04 00:09:00 57.30000 31.18066     573 67.45455 24.10960      742
    #3 event C 2020-05-04 00:11:00 61.45455 27.30701     676 66.44444 23.47931      598
    #4 event D 2020-05-04 00:18:00 61.81818 23.49391     680 67.00000 43.84062      134