Search code examples
rduplicatesmediandeviation

Removing/collapsing duplicate rows in R


I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?

 Probesets=paste("a",1:200,sep="")
 Genes=sample(letters,200,replace=T)
 Value=rnorm(200)
 X=data.frame(Probesets,Genes,Value)
 X=X[order(X$Value,decreasing=T),]
 Y=X[which(!duplicated(X$Genes)),]

Solution

  • Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:

    Y=X[which(!duplicated(X$Genes)),]
    

    Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:

    nrow(Y); length(unique(X$Genes))
    [1] 26
    [1] 26
    

    If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:

    Y=X[!duplicated(X),]
    

    To see how it works consider this example:

    df <- data.frame(
      a = c(1,1,2,3),
      b = c(1,1,3,4)
    )
    df
      a b
    1 1 1
    2 1 1
    3 2 3
    4 3 4
    
    df[!duplicated(df),]
      a b
    1 1 1
    3 2 3
    4 3 4