Search code examples
rduplicatesdataframebinning

Selecting between duplicate data in a data frame


Earlier I asked a question about extracting duplicate lines from a data frame. I now need to run a script to decide which of these duplicates to keep in my final data set.

Duplicate entries in this data set have the same 'Assay' and 'Sample' values. Here is the first 10 lines of the new data set Im working with containing my duplicate entries:

     Assay   Sample    Genotype   Data
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

Id like to run a script that will break these duplicate samples into 4 bins based on the value for 'Data' which can be either 1, 0, or NA:

 1) All values for 'Data' are NA
 2) All values for 'Data' are identical, no NA
 3) At least 1 value for 'Data' is not identical, no NA.
 4) At least 1 value for 'Data' is not identical, at least one is NA.

The expected result from the above data would look like this;

Set 1
Null

Set 2
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0

Set 3
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1

Set 4
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

There are cases in which more than 2 "duplicate" data points exist in this data set. Im not sure even where to start with this as Im a newbie to R.

EDIT: With expected data.


Solution

  • This should be a good start. Depending on how long your dataset is, it may or may not be worth it to optimize this for better speed.

    require(plyr)
    
    # Read data
    data = read.table('data.txt', colClasses=c(NA, NA, 'character', NA, NA))
    
    # Function to pick set
    pickSet <- function(x) {
      if(all(is.na(x$Data))) {
        set = 1
      } else if(length(unique(x$Data)) == 1) {
        set = 2
      } else if(!any(is.na(x$Data))) {
        set = 3
      } else {
        set = 4
      }
      data.frame(Set=set)
    }
    
    # Identify Set for each combo of Assay and Sample
    sets = ddply(data, c('Assay', 'Sample'), pickSet)
    
    # Merge set info back with data
    data = join(data, sets)
    
    # Reformat to list
    sets.list = lapply(1:4, function(x) data[data$Set==x,-5])
    
    > sets.list
    [[1]]
    [1] Assay    Sample   Genotype Data    
    <0 rows> (or 0-length row.names)
    
    [[2]]
         Assay Sample Genotype Data
    5 CCT6-002   0050        G    0
    6 CCT6-002   0050        G    0
    
    [[3]]
         Assay Sample Genotype Data
    1 CCT6-002   1486        A    1
    2 CCT6-002   1486        G    0
    7 CCT6-015   0082        G    0
    8 CCT6-015   0082        T    1
    
    [[4]]
          Assay Sample Genotype Data
    3  CCT6-002   1997        G    0
    4  CCT6-002   1997     <NA>   NA
    9  CCT6-015   0121        G    0
    10 CCT6-015   0121     <NA>   NA