Earlier I asked a question about extracting duplicate lines from a data frame. I now need to run a script to decide which of these duplicates to keep in my final data set.
Duplicate entries in this data set have the same 'Assay' and 'Sample' values. Here is the first 10 lines of the new data set Im working with containing my duplicate entries:
Assay Sample Genotype Data
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
Id like to run a script that will break these duplicate samples into 4 bins based on the value for 'Data' which can be either 1, 0, or NA:
1) All values for 'Data' are NA
2) All values for 'Data' are identical, no NA
3) At least 1 value for 'Data' is not identical, no NA.
4) At least 1 value for 'Data' is not identical, at least one is NA.
The expected result from the above data would look like this;
Set 1
Null
Set 2
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
Set 3
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
Set 4
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
There are cases in which more than 2 "duplicate" data points exist in this data set. Im not sure even where to start with this as Im a newbie to R.
EDIT: With expected data.
This should be a good start. Depending on how long your dataset is, it may or may not be worth it to optimize this for better speed.
require(plyr)
# Read data
data = read.table('data.txt', colClasses=c(NA, NA, 'character', NA, NA))
# Function to pick set
pickSet <- function(x) {
if(all(is.na(x$Data))) {
set = 1
} else if(length(unique(x$Data)) == 1) {
set = 2
} else if(!any(is.na(x$Data))) {
set = 3
} else {
set = 4
}
data.frame(Set=set)
}
# Identify Set for each combo of Assay and Sample
sets = ddply(data, c('Assay', 'Sample'), pickSet)
# Merge set info back with data
data = join(data, sets)
# Reformat to list
sets.list = lapply(1:4, function(x) data[data$Set==x,-5])
> sets.list
[[1]]
[1] Assay Sample Genotype Data
<0 rows> (or 0-length row.names)
[[2]]
Assay Sample Genotype Data
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
[[3]]
Assay Sample Genotype Data
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
[[4]]
Assay Sample Genotype Data
3 CCT6-002 1997 G 0
4 CCT6-002 1997 <NA> NA
9 CCT6-015 0121 G 0
10 CCT6-015 0121 <NA> NA