Search code examples
rdplyrdummy-variablefind-occurrences

Detect first occurrences between two variables in R


I would like to count the first occurrences of two variables (IPC and 2IPC) in R, leaving out cases in which the two variables are the same (e.g. !IPC == 2IPC).

Here is an example of dataset:

**date  IPC     2IPC    occurrence** 
 1968   G01S    Na      1
 1969   G01N    G01S    1
 1969   B62D    B43L    1
 1969   G01S    Na      0
 1970   G01S    G01C    1
 1970   G01S    H04B    1
 1970   G01S    H04B    0
 1971   G01S    H01S    1
 1971   G01S    G01S    0
 1972   H04N    H04N    0
 1972   G01S    G01S    0
 1972   G01S    G01S    0

I used the Excel function COUNTIFS which create a dummy (occurrence) for the first occurrences between two variables. Is it possible to use dplyr for this task?


Solution

  • Using dplyr and assuming that Na values are valid values and not NAs, you may run the following code:

    library(dplyr)
    mydf %>% 
    group_by(IPC,X2IPC) %>%
    mutate(N_occurences=row_number()) %>% 
    mutate(FirstOccurrence=case_when(
        (IPC!=X2IPC) & N_occurences==1 ~ 1,
        (IPC==X2IPC) | N_occurences!=1 ~ 0
    ))
    

    You'll get the following result:

       X..date IPC   X2IPC occurrence.. N_occurences FirstOccurrence
         <int> <chr> <chr>        <int>        <int>           <dbl>
     1    1968 G01S  Na               1            1            1.00
     2    1969 G01N  G01S             1            1            1.00
     3    1969 B62D  B43L             1            1            1.00
     4    1969 G01S  Na               0            2            0   
     5    1970 G01S  G01C             1            1            1.00
     6    1970 G01S  H04B             1            1            1.00
     7    1970 G01S  H04B             0            2            0   
     8    1971 G01S  H01S             1            1            1.00
     9    1971 G01S  G01S             0            1            0   
    10    1972 H04N  H04N             0            1            0   
    11    1972 G01S  G01S             0            2            0   
    12    1972 G01S  G01S             0            3            0
    

    Whether you want the same data frame in you OP, just run the code:

    mydf %>% 
        group_by(IPC,X2IPC) %>%
        mutate(N_occurences=row_number()) %>% 
        mutate(FirstOccurrence=case_when(
            (IPC!=X2IPC) & N_occurences==1 ~ 1,
            (IPC==X2IPC) | N_occurences!=1 ~ 0
        )) %>%
        select(1:3,6)