Search code examples
rgroup-bydplyrcase-when

How do I create an index variable for unique values of X within a group Y?


I have the following table:

id_question  id_event   num_events
2015012713    49508          1
2015012711    49708          1
2015011523    41808          3
2015011523    44008          3
2015011523    44108          3
2015011522    41508          3
2015011522    43608          3
2015011522    43708          3
2015011521    39708          1
2015011519    44208          1

The third column gives the count of events by question. I want to create a variable that would index the events by question only where there are multiple events per question. It would look something like that:

id_question  id_event   num_events  index_event
2015012713    49508          1          
2015012711    49708          1          
2015011523    41808          3          1
2015011523    44008          3          2
2015011523    44108          3          3
2015011522    41508          3          1
2015011522    43608          3          2
2015011522    43708          3          3
2015011521    39708          1          
2015011519    44208          1          

How can I do that?


Solution

  • We can use tidyverse to create an 'index_event' after grouping by 'id_question'. If the number of rows are greater than 1 (n() >1), then get the sequence of rows (row_number()) and the default option in case_when is NA

    library(dplyr)
    df1 %>%
       group_by(id_question) %>%
       mutate(index_event = case_when(n() >1 ~ row_number()))
    # A tibble: 10 x 4
    # Groups:   id_question [6]
    #   id_question id_event num_events index_event
    #         <int>    <int>      <int>       <int>
    # 1  2015012713    49508          1          NA
    # 2  2015012711    49708          1          NA
    # 3  2015011523    41808          3           1
    # 4  2015011523    44008          3           2
    # 5  2015011523    44108          3           3
    # 6  2015011522    41508          3           1
    # 7  2015011522    43608          3           2
    # 8  2015011522    43708          3           3
    # 9  2015011521    39708          1          NA
    #10  2015011519    44208          1          NA
    

    Or with data.table, we use rowid on 'id_question' and change the elements that are 1 in 'num_events' to NA with NA^ (making use of NA^0, NA^1)

    library(data.table)
    setDT(df1)[, index_event := rowid(id_question) * NA^(num_events == 1)]
    

    Or using base R, another option with the sequence of frequency from 'id_question' and change elements to NA as in the previous case

    df1$index_event <-  with(df1, sequence(table(id_question)) * NA^(num_events == 1))
    df1$index_event
    #[1] NA NA  1  2  3  1  2  3 NA NA
    

    data

    df1 <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L, 
    2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L, 
    2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L, 
    44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L, 
    1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
    -10L))