Search code examples
rgenetics

How to convert genotyping data


I have this dataframe (approx dim of 446664 X 234) called mydf (dput is provided). This dataframe has columns REF and ALT.

REF has only one letter in every row, but ALT could have one, two or even three letters separated by a comma (","), the rest of the columns (samples columns) are the columns where I need to do all the work.

Considering any letter in REF to be 0 and the first letter in ALT as 1, second letter 2 and third letter 3, respectively, I need to make a function so that:

  1. I can replace the number in all sample columns (i.e. except in REF and ALT) with the letters and if there is "./.";

  2. Fill them with NA/NA and collapse the "/" to get paired letters in every cell.

  3. Finally I need to invert all the sample columns across the rows (transpose) as shown in the result. Thank you!

    mydf<-
    structure(list(REF = structure(c(1L, 4L, 3L, 2L, 3L), .Label = c("A", 
    "C", "G", "T"), class = "factor"), ALT = structure(c(6L, 6L, 
    1L, 9L, 1L), .Label = c("A", "A,C", "A,G", "A,T", "C", "C,G", 
    "C,T", "G", "G,T", "T"), class = "factor"), X860 = structure(c(1L, 
    3L, 2L, 1L, 1L), .Label = c("./.", "0/0", "0/1", "0/2", "1/1"
    ), class = "factor"), X861 = structure(c(1L, 6L, 2L, 1L, 1L), .Label = c("./.", 
    "0/0", "0/1", "0/2", "1/1", "1/2"), class = "factor"), X862 = structure(c(6L, 
    3L, 1L, 2L, 1L), .Label = c("./.", "0/0", "0/1", "0/2", "1/1", 
    "2/2"), class = "factor")), .Names = c("REF", "ALT", "X860", 
    "X861", "X862"), row.names = c(NA, -5L), class = "data.frame")
    

Expected output:

X860 NANA TC GG NANA NANA
X861 NANA CG GG NANA NANA 
X862 GG TC NANA CC NANA   

Solution

  • Got this but I'm quite unsure of the performance of it:

    letters <- strsplit(paste(mydf$REF,mydf$ALT,sep=","),",") # concatenate the letters to have an index to work on from the numbers
    values <- t(mydf[,3:ncol(mydf)]) # let's work on each column needing values
    nbval <- ncol(values) # Save time for later and save the length of values 
    
    #Prepare the two temp vectors used later
    chars <- vector("character",2) 
    ret <- vector("character",nbval)
    
    #Loop over the rows (and transpose the result)
    t(sapply(rownames(values),
       function(x) { 
         indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes
    
         for(i in 1:nbval) { # Loop over the number of columns :/
           for (j in 1:2) { # Loop over the pair 
             chars[j] <- ifelse(indexes[[i]][j] == ".", "NA",letters[[i]][as.integer(indexes[[i]][j])+1]) # Get NA if . or the letter with the correct index at this postion
           }
           ret[i] <- paste0(chars[1],chars[2]) # concatenate the two chars
         }
         return(ret) # return this for this row
       }
    ))
    

    Output with sample data:

         [,1]   [,2] [,3]   [,4]   [,5]  
    X860 "NANA" "TC" "GG"   "NANA" "NANA"
    X861 "NANA" "CG" "GG"   "NANA" "NANA"
    X862 "GG"   "TC" "NANA" "CC"   "NANA"
    

    Updated version of the function (as the rest of code does not change) from comment:

    #Loop over the rows (and transpose the result)
    t(sapply(rownames(values),
       function(x) {
         indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes
         for(i in 1:nbval) { # Loop over the number of columns :/
           if (values[x,i] == "./.") { # test if we have ./. and if yes, set to NA
             ret[i] <- "NA"
           } else { # if it's not ./. then try to find the corresponding letters
             for (j in 1:2) { # Loop over the pair 
               chars[j] <- ifelse(indexes[[i]][j] == ".", "NA",letters[[i]][as.integer(indexes[[i]][j])+1]) # Get NA if . or the letter with the correct index at this postion
             }
             ret[i] <- paste0(chars[1],chars[2]) # concatenate the two chars
           }
         }
         return(ret) # return this for this row
       }
    )) 
    

    Output:

         [,1] [,2] [,3] [,4] [,5]
    X860 "NA" "TC" "GG" "NA" "NA"
    X861 "NA" "CG" "GG" "NA" "NA"
    X862 "GG" "TC" "NA" "CC" "NA"