Search code examples
rpattern-recognition

How to compile the number of observations that follow a specific pattern?


I have a dataset with three variables (DateTime, Transmitter, and timediff). The timediff column is the time difference between subsequent detections of a transmitter. I want to know how many times the time differences followed a specific pattern. Here is a sample of my data.

> dput(Example)
structure(list(DateTime = structure(c(1501117802, 1501117805, 
1501117853, 1501117857, 1501117913, 1501117917, 1501186253, 1501186254, 
1501186363, 1501186365, 1501186541, 1501186542, 1501186550, 1501186590, 
1501186591, 1501186644, 1501186646, 1501186737, 1501186739, 1501187151
), class = c("POSIXct", "POSIXt"), tzone = "GMT"), Transmitter = c(30767L, 
30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 
30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 
30767L, 30767L, 30767L), timediff = c(44, 3, 48, 4, 56, 4, 50, 
1, 42, 2, 56, 1, 8, 40, 1, 53, 2, 37, 2, 42)), row.names = c(NA, 
20L), class = "data.frame")

So looking at the time difference column, I want to know how many times there is a single timediff < 8seconds, how many times there are two subsequent timediff < 8 seconds, how many times there are three subsequent timediff < 8 seconds, and so on.

Example: In the given dataset, a single timediff <8 seconds happens 7 times while two subsequent timediffs < 8 seconds happens twice.

A "single timediff" = 44, 3 , 48

A "double timediff" = 56, 1, 8, 40

In terms of an output, I'd be looking for something like this...

> dput(output)
structure(list(ID = 30767, Single = 7, Double = 2), class = "data.frame", row.names = c(NA, 
-1L))

Thanks for the help!


Solution

  • One dplyr possibility could be:

    df %>%
     mutate(cond = timediff <= 8) %>%
     group_by(rleid = with(rle(cond), rep(seq_along(lengths), lengths))) %>%
     add_count(rleid, name = "n_timediff") %>%
     filter(cond & row_number() == 1) %>%
     ungroup() %>%
     count(n_timediff)
    
    n_timediff     n
           <int> <int>
    1          1     8
    2          2     1
    

    Considering there could be more values in "Transmitter", you can do (this requires also tidyr):

    df %>%
     mutate(cond = timediff <= 8) %>%
     group_by(Transmitter, rleid = with(rle(cond), rep(seq_along(lengths), lengths))) %>%
     add_count(rleid, name = "n_timediff") %>%
     filter(cond & row_number() == 1) %>%
     ungroup() %>%
     group_by(Transmitter) %>%
     count(n_timediff) %>%
     mutate(n_timediff = paste("timediff", n_timediff, sep = "_")) %>%
     spread(n_timediff, n)
    
      Transmitter timediff_1 timediff_2
            <int>      <int>      <int>
    1       30767          8          1