I have this dataframe
(approx dim of 446664 X 234) called mydf
(dput
is provided). This dataframe
has columns REF
and ALT
.
REF
has only one letter in every row, but ALT
could have one, two or even three letters separated by a comma (","), the rest of the columns (samples columns) are the columns where I need to do all the work.
Considering any letter in REF
to be 0 and the first letter in ALT
as 1, second letter 2 and third letter 3, respectively, I need to make a function so that:
I can replace the number in all sample columns (i.e. except in REF and ALT) with the letters and if there is "./.";
Fill them with NA/NA and collapse the "/" to get paired letters in every cell.
Finally I need to invert all the sample columns across the rows (transpose
) as shown in the result
. Thank you!
mydf<-
structure(list(REF = structure(c(1L, 4L, 3L, 2L, 3L), .Label = c("A",
"C", "G", "T"), class = "factor"), ALT = structure(c(6L, 6L,
1L, 9L, 1L), .Label = c("A", "A,C", "A,G", "A,T", "C", "C,G",
"C,T", "G", "G,T", "T"), class = "factor"), X860 = structure(c(1L,
3L, 2L, 1L, 1L), .Label = c("./.", "0/0", "0/1", "0/2", "1/1"
), class = "factor"), X861 = structure(c(1L, 6L, 2L, 1L, 1L), .Label = c("./.",
"0/0", "0/1", "0/2", "1/1", "1/2"), class = "factor"), X862 = structure(c(6L,
3L, 1L, 2L, 1L), .Label = c("./.", "0/0", "0/1", "0/2", "1/1",
"2/2"), class = "factor")), .Names = c("REF", "ALT", "X860",
"X861", "X862"), row.names = c(NA, -5L), class = "data.frame")
Expected output:
X860 NANA TC GG NANA NANA
X861 NANA CG GG NANA NANA
X862 GG TC NANA CC NANA
Got this but I'm quite unsure of the performance of it:
letters <- strsplit(paste(mydf$REF,mydf$ALT,sep=","),",") # concatenate the letters to have an index to work on from the numbers
values <- t(mydf[,3:ncol(mydf)]) # let's work on each column needing values
nbval <- ncol(values) # Save time for later and save the length of values
#Prepare the two temp vectors used later
chars <- vector("character",2)
ret <- vector("character",nbval)
#Loop over the rows (and transpose the result)
t(sapply(rownames(values),
function(x) {
indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes
for(i in 1:nbval) { # Loop over the number of columns :/
for (j in 1:2) { # Loop over the pair
chars[j] <- ifelse(indexes[[i]][j] == ".", "NA",letters[[i]][as.integer(indexes[[i]][j])+1]) # Get NA if . or the letter with the correct index at this postion
}
ret[i] <- paste0(chars[1],chars[2]) # concatenate the two chars
}
return(ret) # return this for this row
}
))
Output with sample data:
[,1] [,2] [,3] [,4] [,5]
X860 "NANA" "TC" "GG" "NANA" "NANA"
X861 "NANA" "CG" "GG" "NANA" "NANA"
X862 "GG" "TC" "NANA" "CC" "NANA"
Updated version of the function (as the rest of code does not change) from comment:
#Loop over the rows (and transpose the result)
t(sapply(rownames(values),
function(x) {
indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes
for(i in 1:nbval) { # Loop over the number of columns :/
if (values[x,i] == "./.") { # test if we have ./. and if yes, set to NA
ret[i] <- "NA"
} else { # if it's not ./. then try to find the corresponding letters
for (j in 1:2) { # Loop over the pair
chars[j] <- ifelse(indexes[[i]][j] == ".", "NA",letters[[i]][as.integer(indexes[[i]][j])+1]) # Get NA if . or the letter with the correct index at this postion
}
ret[i] <- paste0(chars[1],chars[2]) # concatenate the two chars
}
}
return(ret) # return this for this row
}
))
Output:
[,1] [,2] [,3] [,4] [,5]
X860 "NA" "TC" "GG" "NA" "NA"
X861 "NA" "CG" "GG" "NA" "NA"
X862 "GG" "TC" "NA" "CC" "NA"