Search code examples
rduplicatesconditional-statementsassign

Create ID variable: if ≥1 column duplicate then mark as duplicate


Ive seen many questions about creating a new ID variable, based on multiple columns conditions. However it is usually if var1 AND var2 are double, then mark as duplicate number.

My question is how do you create a new variable ID and mark for duplicates if

  1. var1 is duplicate, OR
  2. var2 is duplicate, OR
  3. var3 is duplicate.

Example dataset (EDITED):

      pat         var1   var2       var3     
 1     1          1      10           1       
 2     2          16     10          11       
 3     3          21     27           2       
 4     4          22     29           2       
 5     5          31     35           3       
 6     6          44     47           4       
 7     7          5      50           5       
 8     8          6      60           6       
 9     9          7      70           7       
10    10          8      80           7       
11    11          9      90           8       
12    12         10      11           9       
13    13         11      13           91      
14    14         11      14           10     
15    15         NA      15           15
16    16         NA      15           16
17    17         12      NA           17
18    18         13      NA           18

 sample <- data.frame(pat = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18), 
                     var1 = c(1,16,21,22,31,44,5,6,7,8,9,10,11,11, NA,NA,12,13), 
var2 = c(10,10,27,29,35,47,50,60,70,80,90,11,13,14,15,15,NA,NA),
                     var3 = c(1,11,2,2,3,4,5,6,7,7,8,9,91,10,15,16,17,18)

So if one of the three var variables is duplicated, then the new ID variable should show a duplicate ID number.

Desired output (EDITED):

      pat         var1   var2       var3     ID
 1     1          1      10           1       1
 2     2          16     10          11       1
 3     3          21     27           2       2
 4     4          22     29           2       2
 5     5          31     35           3       3
 6     6          44     47           4       4
 7     7          5      50           5       5
 8     8          6      60           6       6
 9     9          7      70           7       7
10    10          8      80           7       7
11    11          9      90           8       8
12    12         10      11           9       9
13    13         11      13           91      10
14    14         11      14           10      10
15    15         NA      15           15      11
16    16         NA      15           16      11
17    17         12      NA           17      12
18    18         13      NA           18      13

I couldnt find a question based on similar conditions therefor im asking it. Many thanks in advance.

EDIT The answer of Ben works perfect if there are no NA values present. Unfortunately I did not mention I also had NA values present in for var1,2 or 3. A NA value meant that idnumber for Var1/2/3 was missing. So ive adjusted the question a bit and added some NA values. The added question is:

Is it possible for a script to judge: if var1=c(NA,NA), var2=(1,1) and var3=(1,2) to report a duplicate but if var1=c(NA,NA), var2=c(1,2) and var3=(1,2) to report a unique number?


Solution

  • Maybe you could try the following. Here we use tail and head to refer to rows 2 through 14 compared to 1 through 13 (effectively comparing each row with the prior row).

    We can use rowSums of differences between each row and the previous row. If the difference is zero, then the result is TRUE (or 1), and the ID would increase for each value of 1 from row to row. These are cumulatively summed with cumsum.

    The use of c will make the first ID 1. Also, the cumsum is adjusted by 1 to account for the initial ID of 1.

    sample$ID <- 
      c(1, cumsum(rowSums(tail(sample[-1], -1) == head(sample[-1], -1)) == 0) + 1)
    
    sample
    

    Output

       pat var1 var2 var3 ID
    1    1    1   10    1  1
    2    2   16   10   11  1
    3    3   21   27    2  2
    4    4   22   29    2  2
    5    5   31   35    3  3
    6    6   44   47    4  4
    7    7    5   50    5  5
    8    8    6   60    6  6
    9    9    7   70    7  7
    10  10    8   80    7  7
    11  11    9   90    8  8
    12  12   10   11    9  9
    13  13   11   13   91 10
    14  14   11   14   10 10
    

    Edit: Based on comment below, there are occasions where the value is NA which should be ignored. In the example above, NA repeated (such as var2 in rows 17-18) does not count as a duplicate.

    Here is another approach. You can use sapply to go through the rows numbers of your data.frame.

    You can use mapply to subtract each var from the row next to a given row, and check if any have a value of zero. Note that na.rm = T will ignore missing NA values.

    sample$ID <-
      c(1,
        cumsum(
          sapply(
            seq_len(nrow(sample)-1),
            \(x) {
              !any(mapply(`-`, sample[x, -1, drop = T], sample[x + 1, -1, drop = T]) == 0, na.rm = T)
            }
          )
        ) + 1
      )
    

    Output

       pat var1 var2 var3 ID
    1    1    1   10    1  1
    2    2   16   10   11  1
    3    3   21   27    2  2
    4    4   22   29    2  2
    5    5   31   35    3  3
    6    6   44   47    4  4
    7    7    5   50    5  5
    8    8    6   60    6  6
    9    9    7   70    7  7
    10  10    8   80    7  7
    11  11    9   90    8  8
    12  12   10   11    9  9
    13  13   11   13   91 10
    14  14   11   14   10 10
    15  15   NA   15   15 11
    16  16   NA   15   16 11
    17  17   12   NA   17 12
    18  18   13   NA   18 13