Search code examples
rdplyrdata.tablesnow

counting events by event history with R


I have a data table that is structured like this, where I have kept track of processes. If an event occured then I marked a 1 next to it in that day, otherwise 0. I have shown the first few events here, but the real dataset has many many rows (over 500,000), with many unique process id's.

process_id    date         event
00001       01/01/12     0
00002       01/01/12     1
00003       01/01/12     0
...         ...          ...
00001       01/01/19     1
00002       01/01/19     0
00003       01/01/19     1

What I would like to know now is for each observation (row) if an event had occurred in the last year(not including the current date) for that process_id and add a column denoting the flag. Suppose that the row

00002       10/01/18     1

occured in the table, then an output table might look like

process_id     date         event    previousEvent     
00001          01/01/12     0        NA
00002          01/01/12     1        NA
00003          01/01/12     0        NA
...            ...          ...      ...
00001          01/01/19     1        0
00002          01/01/19     0        1
00003          01/01/19     1        0

My current way of doing this is by filtering using the dplyr toolkit, however I assume that since it is not a vectorized approach, that it may not be the most efficient way of doing things. Using the doSNOW package for a parallelized approach, the main loop of the program looks like the following. It simply counts how many times the event occured to determine if the event happened in the last year or not. However, even this approach takes a very long time (about an hour for this many rows on my machine)

result <- foreach(i = 1:nrow(data),
              .options.snow=opts, .combine='rbind', .packages = 'dplyr') 
 %dopar%
{
  d <- nrow(data%>%
      filter(process_id %in% data[i,]$process_id ) %>%
      filter(date>= data[i,]$LastYearDate) %>%
      filter(date< data[i,]$date) %>%
      filter(event > 0))
  return(ifelse(d,1,0))
}

Could there be a better approach? I am pretty new with R and the many techniques to filter tables.


Solution

  • You could combine this idiom with a non-equi join:

    library(data.table)
    library(lubridate)
    
    df <- read.table(header=T, text="
    process_id    date         event
    00001       00/01/20     1
    00002       00/01/20     1
    00003       00/01/20     0
    00001       01/01/19     1
    00002       01/01/19     0
    00003       01/01/19     1")
    
    dt <- as.data.table(df)
    
    dt[, date := as.POSIXct(date, format = "%y/%m/%d")]
    dt[, prev_year := date - lubridate::dyears(1L)]
    
    positives <- dt[.(1), .(process_id, date, event), on = "event"]
    
    dt[, prev_event := positives[.SD,
                                 .(x.event),
                                 on = .(process_id, date < date, date >= prev_year),
                                 mult = "last"]]
    
    print(dt)
       process_id       date event  prev_year prev_event
    1:          1 2000-01-20     1 1999-01-20         NA
    2:          2 2000-01-20     1 1999-01-20         NA
    3:          3 2000-01-20     0 1999-01-20         NA
    4:          1 2001-01-19     1 2000-01-20          1
    5:          2 2001-01-19     0 2000-01-20          1
    6:          3 2001-01-19     1 2000-01-20         NA
    

    Adjust the date format if necessary, and remove prev_year afterwards if you don't need it.

    And if you'd like to add also the date at which the previous event occurred, change the line before print to:

    dt[, `:=`(
      c("prev_event", "prev_date"),
      positives[.SD, .(x.event, x.date), on = .(process_id, date < date, date >= prev_year), mult = "last"]
    )]
    

    A bit of a shameless plug: with the new version of table.express, you could also write the above as:

    library(table.express)
    library(data.table)
    library(lubridate)
    
    dt <- as.data.table(df) %>%
      start_expr %>%
      mutate(date = as.POSIXct(date, format = "%y/%m/%d")) %>%
      mutate(prev_year = date - lubridate::dyears(1L)) %>%
      end_expr
    
    positives <- dt %>%
      start_expr %>%
      filter_on(event = 1) %>%
      select(process_id, date, event) %>%
      end_expr
    
    dt %>%
      start_expr %>%
      mutate_join(positives,
                  process_id, date > date, prev_year <= date,
                  mult = "last",
                  .SDcols = c(prev_event = "event", prev_date = "date")) %>%
      end_expr
    
    print(dt)
       process_id       date event  prev_year prev_event  prev_date
    1:          1 2000-01-20     1 1999-01-20         NA       <NA>
    2:          2 2000-01-20     1 1999-01-20         NA       <NA>
    3:          3 2000-01-20     0 1999-01-20         NA       <NA>
    4:          1 2001-01-19     1 2000-01-20          1 2000-01-20
    5:          2 2001-01-19     0 2000-01-20          1 2000-01-20
    6:          3 2001-01-19     1 2000-01-20         NA       <NA>