Search code examples
pythonrmatrixone-hot-encoding

How to generate one hot encoding for DNA sequences using R or python


I want to generate one hot coding matrix for a list of DNA sequences. I have tried to solve my problem from the following link How to generate one hot encoding for DNA sequences? but some of the solutions are given only for one single DNA sequence and not for a list of DNA sequences.

For example

def one_hot_encode(seq):
    mapping = dict(zip("ACGT", range(4)))    
    seq2 = [mapping[i] for i in seq]
    return np.eye(4)[seq2]

one_hot_encode("AACGT")

In the given above code, if I run one_hot_encode("AACGT","GGTAC","CGTAC") it will fail, also i want to generate matrix as output.

Currently, I am working in R and below is my DNA sequence in the r data frame(single-column file)

ACTTTA
TTGATG
CTTACG
GTACGT

Expected output

1   0   0   0   0   1   0   0   0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
0   0   0   1   0   0   0   1   0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0
0   1   0   0   0   0   0   1   0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0
0   0   1   0   0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0   0   0   0   1

is it possible to do this in R?


Solution

  • library(stringr)
    
    dataIn <- c(
      "ACTTTA", 
      "TTGATG", 
      "CTTACG", 
      "GTACGT"
      )
    
    one_hot_encode <- function(baseSeq) {
     outSeq <- stringr::str_replace_all(baseSeq, c("A" = "1000",
                                      "C" = "0100",
                                      "G" = "0010",
                                      "T" = "0001"))
     outSeq <- str_extract_all(outSeq, boundary("character"))
     unlist(outSeq)
    }
    
    data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
    

    gives

     > data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
      X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
    1  1  0  0  0  0  1  0  0  0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
    2  0  0  0  1  0  0  0  1  0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0
    3  0  1  0  0  0  0  0  1  0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0
    4  0  0  1  0  0  0  0  1  1   0   0   0   0   1   0   0   0   0   1   0   0   0   0   1
    

    Some row and column names might tidy up the output, but I think this is essentially what you were after?