Search code examples
rdataframeduplicates

Identifying duplicated values in a column but ignoring blank cells


I have a small dataframe where I want to identify possible duplicated values in the second column 'Real_sample'. However, this column also contains blank values and these are included in the duplicated() function.

This is my dataframe:

structure(list(Deelnemernr. = c("4781061T0", "4781074T0", "4781076T0", 
"4781087T0", "4781047T0", "4781027T0", "4781024T0", "4781022T0", 
"4781017T0", "4781023T0", "4781078T0", "4781079T0", "4781063T0", 
"4781008T0", "4781026T0", "4781014T0", "4781015T0", "4781072T0", 
"4781046T0", "4781032T0", "4781077T0", "4781031T0", "4781030T0", 
"4781052T0", "4781056T0", "4781004T0", "4781043T0", "4781090T0", 
"4781036T0", "4781028T0", "4781053T0", "4781068T0", "4781048T0", 
"4781041T0", "4781050T0", "4781542T0", "4781095T0", "4781020T0", 
"4781097T0", "4781182T0"), Real_sample = c("4781061T0", "4781074T0", 
"4781076T0", "4781087T0", "4781047T0", "4781027T0", "4781024T0", 
"4781022T0", "4781017T0", "4781023T0", "4781078T0", "4781079T0", 
"4781063T0", "4781008T0", "4781026T0", "4781014T0", "", "4781072T0", 
"4781061T0", "4781032T0", "4781077T0", "4781031T0", "4781030T0", 
"", "4781056T0", "4781004T0", "4781043T0", "4781090T0", "4781036T0", 
"4781028T0", "4781053T0", "4781068T0", "4781048T0", "4781041T0", 
"4781050T0", "4781542T0", "4781095T0", "4781020T0", "4781097T0", 
"4781182T0")), row.names = c(NA, -40L), class = c("tbl_df", "tbl", 
"data.frame"))

Can anyone help me to ONLY identify the duplicated values and ignore the blanks?

Thanks


Solution

  • duplicated() also allows you to define incomparable values, this way you can exclude "" from comparisons:

    df$dup <- duplicated(df$Real_sample, incomparables = "")
    df[15:25, ]
    #>    Deelnemernr. Real_sample   dup
    #> 15    4781026T0   4781026T0 FALSE
    #> 16    4781014T0   4781014T0 FALSE
    #> 17    4781015T0             FALSE
    #> 18    4781072T0   4781072T0 FALSE
    #> 19    4781046T0   4781061T0  TRUE
    #> 20    4781032T0   4781032T0 FALSE
    #> 21    4781077T0   4781077T0 FALSE
    #> 22    4781031T0   4781031T0 FALSE
    #> 23    4781030T0   4781030T0 FALSE
    #> 24    4781052T0             FALSE
    #> 25    4781056T0   4781056T0 FALSE