Search code examples
rdna-sequenceape-phylo

How to use as.DNAbin{ape} with DNA sequences stored in a dataframe?


I have a dataframe with loci names in one column and DNA sequences in the other. I'm trying to use as.DNAbin{ape} or similar to create a DNAbin object.

Here some example data:

x <- structure(c("55548", "43297", "35309", "34468", "AATTCAATGCTCGGGAAGCAAGGAAAGCTGGGGACCAACTTCTCTTGGAGACATGAGCTTAGTGCAGTTAGATCGGAAGAGCA", "AATTCCTAAAACACCAATCAAGTTGGTGTTGCTAATTTCAACACCAACTTGTTGATCTTCACGTTCACAACCGTCTTCACGTT", "AATTCACCACCACCACTAGCATACCATCCACCTCCATCACCACCACCGGTTAAGATCGGAAGAGCACACTCTGAACTCCAGTC", "AATTCTATTGGTCATCACAATGGTGGTCCGTGGCTCACGTGCGTTCCTTGTGCAGGTCAACAGGTCAAGTTAAGATCGGAAGA"), .Dim = c(4L, 2L))

If I try y <- as.DNA(x) R creates a sort of DNAbin object with 4 DNA sequences (the 4 rows of the example) of length 2 (the two columns, I assume), there is no labels and of course the base composition doesn't work either.

The documentation is not very clear, but after playing with the woodmouse example data of the package I think that what I need to do is to create a matrix with each base as a column and then use as.DNAbin. I.e. in the above example a 4 x 84 matrix (1 column for locus name and 83 for the sequences?). Any advice on how to do this? Or any better idea?

Thanks


Solution

  • First parameter of as.DNAbin should be a matrix or a list containing the DNA sequences, or an object of class "alignment". So, your idea is right.

    Given x is the structure from original post, the code below prepares matrix y:

    y <- t(sapply(strsplit(x[,2],""), tolower))
    rownames(y) <- x[,1]
    

    Then as.DNAbin(y) shows:

    4 DNA sequences in binary format stored in a matrix.
    
    All sequences of same length: 83 
    
    Labels: 55548 43297 35309 34468 
    
    Base composition:
        a     c     g     t 
    0.289 0.262 0.205 0.244