I have a small dataframe where I want to identify possible duplicated values in the second column 'Real_sample'. However, this column also contains blank values and these are included in the duplicated()
function.
This is my dataframe:
structure(list(Deelnemernr. = c("4781061T0", "4781074T0", "4781076T0",
"4781087T0", "4781047T0", "4781027T0", "4781024T0", "4781022T0",
"4781017T0", "4781023T0", "4781078T0", "4781079T0", "4781063T0",
"4781008T0", "4781026T0", "4781014T0", "4781015T0", "4781072T0",
"4781046T0", "4781032T0", "4781077T0", "4781031T0", "4781030T0",
"4781052T0", "4781056T0", "4781004T0", "4781043T0", "4781090T0",
"4781036T0", "4781028T0", "4781053T0", "4781068T0", "4781048T0",
"4781041T0", "4781050T0", "4781542T0", "4781095T0", "4781020T0",
"4781097T0", "4781182T0"), Real_sample = c("4781061T0", "4781074T0",
"4781076T0", "4781087T0", "4781047T0", "4781027T0", "4781024T0",
"4781022T0", "4781017T0", "4781023T0", "4781078T0", "4781079T0",
"4781063T0", "4781008T0", "4781026T0", "4781014T0", "", "4781072T0",
"4781061T0", "4781032T0", "4781077T0", "4781031T0", "4781030T0",
"", "4781056T0", "4781004T0", "4781043T0", "4781090T0", "4781036T0",
"4781028T0", "4781053T0", "4781068T0", "4781048T0", "4781041T0",
"4781050T0", "4781542T0", "4781095T0", "4781020T0", "4781097T0",
"4781182T0")), row.names = c(NA, -40L), class = c("tbl_df", "tbl",
"data.frame"))
Can anyone help me to ONLY identify the duplicated values and ignore the blanks?
Thanks
duplicated()
also allows you to define incomparable values, this way you can exclude ""
from comparisons:
df$dup <- duplicated(df$Real_sample, incomparables = "")
df[15:25, ]
#> Deelnemernr. Real_sample dup
#> 15 4781026T0 4781026T0 FALSE
#> 16 4781014T0 4781014T0 FALSE
#> 17 4781015T0 FALSE
#> 18 4781072T0 4781072T0 FALSE
#> 19 4781046T0 4781061T0 TRUE
#> 20 4781032T0 4781032T0 FALSE
#> 21 4781077T0 4781077T0 FALSE
#> 22 4781031T0 4781031T0 FALSE
#> 23 4781030T0 4781030T0 FALSE
#> 24 4781052T0 FALSE
#> 25 4781056T0 4781056T0 FALSE