Search code examples
rperformancematrixgenome

Efficient code to map genotype matrix in R


Hi I want to convert a matrix of genotypes, encoded as triples to a matrix encoded as 0, 1, 2, i.e.

c(0,0,1) <-> 0; c(0,1,0) <-> 1; c(0,0,1) <-> 2

First here is some code to generate the matrix that needs to be reduced.

# generate genotypes
expand.G = function(n,p){
  probs = runif(n = p)
  G012.rows = matrix(rbinom(2,prob = probs,n=n*p),nrow = p)
  colnames(G012.rows) = paste('s',1:n,sep = '')
  rownames(G012.rows) = paste('g',1:p, sep = '')
  G012.cols = t(G012.rows)

  expand.geno = function(g){
    if(g == 0){return(c(1,0,0))}
    if(g == 1){return(c(0,1,0))}
    if(g == 2){return(c(0,0,1))}
  }

  gtype = c()
  for(i in 1:length(c(G012.cols))){
    gtype = c(
      gtype,
      expand.geno(c(G012.cols)[i])
    )
  }

  length(gtype)

  G = matrix(gtype,byrow = T, nrow = p)
  colnames(G) = paste('s',rep(1:n,each = 3),c('1','2','3'),sep = '')
  rownames(G) = paste('g',1:p, sep = '')
  print(G[1:10,1:15])
  print(G012.rows[1:10,1:5])

  return(G)
}

The output has 3n columns and p rows, where n is sample size and p is number of genotypes. Now we can reduce the matrix back to 0,1,2 coding with the following functions

reduce012 = function(x){
  if(identical(x, c(1,0,0))){
    return(0)
  } else if(identical(x, c(0,1,0))){
    return(1)
  } else if(identical(x,  c(0,0,1))){
    return(2)
  } else { 
    return(NA)
  }
}

reduce.G = function(G.gen){
  G.vec = 
    mapply(function(i,j) reduce012(as.numeric(G.gen[i,(3*j-2):(3*j)])), 
           i=expand.grid(1:(ncol(G.gen)/3),1:nrow(G.gen))[,2], 
           j=expand.grid(1:(ncol(G.gen)/3),1:nrow(G.gen))[,1]
    )

  G = matrix(G.vec, nrow = ncol(G.gen)/3, ncol = nrow(G.gen))
  colnames(G) = rownames(G.gen)
  return(G)
}

reduce.G.loop = function(G.gen){
  G = matrix(NA,nrow = ncol(G.gen)/3, ncol = nrow(G.gen))
  for(i in 1:nrow(G.gen)){
    for(j in 1:(ncol(G.gen)/3)){
      G[j,i] = reduce012(as.numeric(G.gen[i,(3*j-2):(3*j)]))
    }
  }
  colnames(G) = rownames(G.gen)
  return(G)
}

The output is n rows by p columns. It is incidental, but intentional, that the matrix encoded as 0,1,2 is the transpose of the matrix encoded as triples.

The code is not particularly fast. What is bothering me is that the the timing goes with n^2. Can you explain or supply more efficient code?

G = expand.G(1000,20)
system.time(reduce.G(G))
system.time(reduce.G.loop(G))

G = expand.G(2000,20)
system.time(reduce.G(G))
system.time(reduce.G.loop(G))

G = expand.G(4000,20)
system.time(reduce.G(G))
system.time(reduce.G.loop(G))

Solution

  • You can simply make an accessor lookup table:

    decode <- array(dim = c(3, 3, 3))
    decode[cbind(1, 0, 0) + 1] <- 0
    decode[cbind(0, 1, 0) + 1] <- 1
    decode[cbind(0, 0, 1) + 1] <- 2
    

    And then, just do:

    matrix(decode[matrix(t(G + 1), ncol = 3, byrow = TRUE)], ncol = nrow(G))
    

    This full vectorized R version will give you the same matrix, without dimnames and super fast.

    Yet, if you have much larger matrices, you should really use Rcpp for both memory and timing issues.