Search code examples
rdata-visualizationsequence-analysis

seqinr dotplot - change axis


I have to datasets: seq1 and seq2 (DNA sequences). I wanted to do a dataplot, comparing the two sequences and placing a dot where the two sequences match. I was able to accomplish this using seqinr's dotplot, but what I am unable to do is to list the sequences on the axis, so you can see which dots matching. Essentially, I want to replace the numbers with the sequence letters.

Is there anyway to do this? Maybe through ggplot2?

Dotplot

These are my sequences:

seq1 <- c("G","C","T","A","G","T","C","A","G","A","T","C","T","G","A","C","G","C","T","A")
seq2 <- c("G","A","T","G","G","T","C","A","C","A","T","C","T","G","C","C","G","C")

And this is how I generated this graph:

dotPlot(seq1, seq2, main = "Dot plot of 2 different sequences
\nwsize = 4, wstep = 1, nmatch = 3", wsize = 4, wstep = 1, nmatch = 3)

Solution

  • How about a modified version (aka "a quick-and-dirty hack") of the dotplot-function? First, copy and paste the following code into R:

    dotplot<-function (seq1, seq2, wsize = 1, wstep = 1, nmatch = 1, cols = c("white", 
        "black"), xlab = deparse(substitute(seq1)), ylab = deparse(substitute(seq2)), type=2,
        ...) {
    
        cat("This is a modification of the function dotPlot from package seqinr.\n")
    
        require(seqinr)
    
        if (nchar(seq1[1]) > 1) 
            stop("seq1 should be provided as a vector of single chars")
        if (nchar(seq2[1]) > 1) 
            stop("seq2 should be provided as a vector of single chars")
        if (wsize < 1) 
            stop("non allowed value for wsize")
        if (wstep < 1) 
            stop("non allowed value for wstep")
        if (nmatch < 1) 
            stop("non allowed value for nmatch")
        if (nmatch > wsize) 
            stop("nmatch > wsize is not allowed")
        mkwin <- function(seq, wsize, wstep) {
            sapply(seq(from = 1, to = length(seq) - wsize + 1, by = wstep), 
                function(i) c2s(seq[i:(i + wsize - 1)]))
        }
        wseq1 <- mkwin(seq1, wsize, wstep)
        wseq2 <- mkwin(seq2, wsize, wstep)
        if (nmatch == wsize) {
            xy <- outer(wseq1, wseq2, "==")
        }
        else {
            "%==%" <- function(x, y) colSums(sapply(x, s2c) == sapply(y, 
                s2c)) >= nmatch
            xy <- outer(wseq1, wseq2, "%==%")
        }
    
        if(type==1) {
           image(x = seq(from = 1, to = length(seq1), length = length(wseq1)), 
               y = seq(from = 1, to = length(seq2), length = length(wseq2)), 
               z = xy, col = col, xlab = xlab, ylab = ylab, axes=F, ...)
           box()
        }
    
        colnames(xy)<-wseq2
        rownames(xy)<-wseq1
    
        xy2<-matrix(nrow=length(seq1), ncol=length(seq2), data=FALSE)
        rownames(xy2)<-seq1
        colnames(xy2)<-seq2
        ind<-which(xy, arr.ind=T)
        xy2[ind]<-TRUE
        ind<-data.frame(ind, row.names=NULL)    
    
        res<-data.frame(row=c(), col=c())
        for(i in 1:nrow(ind)) {
           DF<-data.frame(row=seq(from=ind$row[i], to=ind$row[i]+wsize-1),
                          col=seq(from=ind$col[i], to=ind$col[i]+wsize-1))
           res<-rbind(res, DF)
        }
        xy2[as.matrix(res)]<-TRUE
    
        if(type==2) {
           image(x = seq(from = 1, to = length(seq1)), y = seq(from = 1, to = length(seq2)), z = xy2, col = cols, xlab = xlab, ylab = ylab, axes=F, ...)
           box()
           axis(side=1, at=1:length(seq1), labels=seq1)
           axis(side=2, at=1:length(seq2), labels=seq2, las=1)
        }
    
        out<-list(type1=xy, type2=xy2)
        return(out)
    }
    

    Then running the example you provided:

    seq1 <- c("G","C","T","A","G","T","C","A","G","A","T","C","T","G","A","C","G","C","T","A")
    seq2 <- c("G","A","T","G","G","T","C","A","C","A","T","C","T","G","C","C","G","C")
    xy<-dotplot(seq1, seq2, wsize = 4, wstep = 1, nmatch = 3)
    

    should produce the following plot:

    enter image description here

    In short, I expanded the function a little to produce a full matrix between the two sequences. If you inspect the output object xy, you'll see the original (type1) matrix, and the expanded one (type2). For very long sequences this modification is not efficient, not to mention that the nucleotide/amino acid labels on the axes will overlap each other. You can change the plot type between type1 and type2 using the new argument type.