Search code examples
rfastadna-sequence

Tajimas D for sequences of different length


I am trying to use adegent in R for calculating the Tajima's D. I have a DNA.bin object containing my sequences from different populations. The sequences are of varying length because of the SNP and indels in them. I get the following error when I run the tajima.test function:

Error in as.matrix.DNAbin(x) : DNA sequences in list not of the same length.

How could one deal with calculating Tajima's D for sequences of different length?

Here's what I did so far:

   x <- structure(c("55548", "43297", "35309", "34468", "AATTCAATGCTCGGGAAGCAAGGAAAGCTGGGGACCAACTTCTCTTGGAGACATGAGCTTAGTGCAGTTAGATCGGAAGAGCA", "AATTCCTAAAACACCAATCAAGTTGGTGTTGCTAATTTCAACACCAACTTGTTGATCTTCACGTTCACAACCGTCTTCACGTT", "AATTCACCACCACCACTAGCATACCATCCACCTCCATCACCACCACCGGTTAAGATCGGAAGAGCACACTCTGAACTGTAAACCCAGTC", "AATTCTATTGGTCATCACAATGGTGGTCCGTGGCTCACGTGCGTTCCTTGTGCAGGTCAACAGGTCAAGTTAAGATCGGAAGA"), .Dim = c(4L, 2L))
   y <- t(sapply(strsplit(x[,2],""), tolower))
   my.dnabin <- as.DNAbin(y)
   tajima.test(my.dnabin)

Solution

  • Tajima's D uses pairwise comparisons of nucleotides, and thus requires that your DNA is properly aligned before running. Yours is not at present, so you get an error. Most DNA.bin methods assume your DNA is aligned - this should be your first step.

    To align your DNA, try the clustal command from ape - you will need clustal installed. I've used clustal omega as I don't think R is the best way to do DNA alignments.

    You will get an aligned DNA sequence which is now suitable for Tajima's D (note the gaps, used to align the sequences):

    x <- structure(c("55548", "43297", "35309", "34468", 
                     "AATTCAATGCTCGGGAAGCAAGGAAAGCT---GGGGACCAACTTCTCTTGGAGACATGAGCTTAGTGCAGTTAGATCGGAAGAGCA-----------------------",
                     "AATTCCTAAAACACCAATCAAGT----TG---------GTGTTGCTAATTTCAACACCAACTTGTTGAT------------CTTCACGTTCACAACCGTCTTCACGTT-",
                     "-----AATTCACCA-------------CCACCACTAGCATACCATCCACCT--CCATCACCACCACCGGTTAAGATCGGAAGAGCACACTCTGAACTGTAAACCCAGTC",
                     "AATTCTATTGGTCATCACAATGGTGGTCCGTGGCTCACGTGCGTTCCTTGTGCAGGTCAACAGGTCAAGTTAAGATCGGAAGA--------------------------"), .Dim = c(4L, 2L))
    

    and now your test will run:

       y <- t(sapply(strsplit(x[,2],""), tolower))
       my.dnabin <- as.DNAbin(y)
       tajima.test(my.dnabin)
    $D
    [1] -5.624054
    
    $Pval.normal
    [1] 1.865271e-08
    
    $Pval.beta
    [1] 0