Ive seen many questions about creating a new ID variable, based on multiple columns conditions. However it is usually if var1 AND var2 are double, then mark as duplicate number.
My question is how do you create a new variable ID and mark for duplicates if
Example dataset (EDITED):
pat var1 var2 var3
1 1 1 10 1
2 2 16 10 11
3 3 21 27 2
4 4 22 29 2
5 5 31 35 3
6 6 44 47 4
7 7 5 50 5
8 8 6 60 6
9 9 7 70 7
10 10 8 80 7
11 11 9 90 8
12 12 10 11 9
13 13 11 13 91
14 14 11 14 10
15 15 NA 15 15
16 16 NA 15 16
17 17 12 NA 17
18 18 13 NA 18
sample <- data.frame(pat = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),
var1 = c(1,16,21,22,31,44,5,6,7,8,9,10,11,11, NA,NA,12,13),
var2 = c(10,10,27,29,35,47,50,60,70,80,90,11,13,14,15,15,NA,NA),
var3 = c(1,11,2,2,3,4,5,6,7,7,8,9,91,10,15,16,17,18)
So if one of the three var variables is duplicated, then the new ID variable should show a duplicate ID number.
Desired output (EDITED):
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13
I couldnt find a question based on similar conditions therefor im asking it. Many thanks in advance.
EDIT The answer of Ben works perfect if there are no NA values present. Unfortunately I did not mention I also had NA values present in for var1,2 or 3. A NA value meant that idnumber for Var1/2/3 was missing. So ive adjusted the question a bit and added some NA values. The added question is:
Is it possible for a script to judge: if var1=c(NA,NA), var2=(1,1) and var3=(1,2) to report a duplicate but if var1=c(NA,NA), var2=c(1,2) and var3=(1,2) to report a unique number?
Maybe you could try the following. Here we use tail
and head
to refer to rows 2 through 14 compared to 1 through 13 (effectively comparing each row with the prior row).
We can use rowSums
of differences between each row and the previous row. If the difference is zero, then the result is TRUE
(or 1), and the ID
would increase for each value of 1 from row to row. These are cumulatively summed with cumsum
.
The use of c
will make the first ID
1. Also, the cumsum
is adjusted by 1 to account for the initial ID
of 1.
sample$ID <-
c(1, cumsum(rowSums(tail(sample[-1], -1) == head(sample[-1], -1)) == 0) + 1)
sample
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
Edit: Based on comment below, there are occasions where the value is NA
which should be ignored. In the example above, NA
repeated (such as var2
in rows 17-18) does not count as a duplicate.
Here is another approach. You can use sapply
to go through the rows numbers of your data.frame.
You can use mapply
to subtract each var from the row next to a given row, and check if any
have a value of zero. Note that na.rm = T
will ignore missing NA
values.
sample$ID <-
c(1,
cumsum(
sapply(
seq_len(nrow(sample)-1),
\(x) {
!any(mapply(`-`, sample[x, -1, drop = T], sample[x + 1, -1, drop = T]) == 0, na.rm = T)
}
)
) + 1
)
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13