Search code examples
rdplyrtidyrsummarize

Extracting multiple rows for each ID based on a condition


I have a data frame with thousands of rows but a sample is given below:

     userid     event
1     123        view
2     123        view
3     123       order
4     345        view
5     345        view
6     345        view
7     345       order
8     111        view
9     111       order
10    111        view
11    111        view
12    111        view
13    333        view
14    333        view
15    333        view

dput(data)

structure(list(userid = c(123, 123, 123, 345, 345, 345, 345, 
111, 111, 111, 111, 111, 333, 333, 333), eventaction = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("order", 
"view"), class = "factor")), .Names = c("userid", "event"
), row.names = c(NA, -15L), class = "data.frame")

What I am looking at doing is to extract all the rows of userid's that contains the word "order" under event. The result will contain all the rows for userid's excluding userid = 333 as the eventaction does not contain an order entry.

The second task is to count the number of occurences for "view" before the order entry. I will appreciate help and pointers.

Thanks.


Solution

  • We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'userid', if there are any 'event' that is an 'order' in the 'userid', return the Subset of Data.table' (.SD)

    library(data.table)
    setDT(data)[,if(any(event=="order")) .SD , by = userid]
    

    Or using dplyr, we filter for any 'order' in the 'event' after grouping by 'userid'.

    library(dplyr)
    data %>%
        group_by(userid) %>%
        filter(any(event == "order"))