I have a dataframe with loci names in one column and DNA sequences in the other. I'm trying to use as.DNAbin{ape}
or similar to create a DNAbin object.
Here some example data:
x <- structure(c("55548", "43297", "35309", "34468", "AATTCAATGCTCGGGAAGCAAGGAAAGCTGGGGACCAACTTCTCTTGGAGACATGAGCTTAGTGCAGTTAGATCGGAAGAGCA", "AATTCCTAAAACACCAATCAAGTTGGTGTTGCTAATTTCAACACCAACTTGTTGATCTTCACGTTCACAACCGTCTTCACGTT", "AATTCACCACCACCACTAGCATACCATCCACCTCCATCACCACCACCGGTTAAGATCGGAAGAGCACACTCTGAACTCCAGTC", "AATTCTATTGGTCATCACAATGGTGGTCCGTGGCTCACGTGCGTTCCTTGTGCAGGTCAACAGGTCAAGTTAAGATCGGAAGA"), .Dim = c(4L, 2L))
If I try y <- as.DNA(x)
R creates a sort of DNAbin object with 4 DNA sequences (the 4 rows of the example) of length 2 (the two columns, I assume), there is no labels and of course the base composition doesn't work either.
The documentation is not very clear, but after playing with the woodmouse example data of the package I think that what I need to do is to create a matrix with each base as a column and then use as.DNAbin
. I.e. in the above example a 4 x 84 matrix (1 column for locus name and 83 for the sequences?). Any advice on how to do this? Or any better idea?
Thanks
First parameter of as.DNAbin
should be a matrix or a list containing the DNA sequences, or an object of class "alignment". So, your idea is right.
Given x
is the structure from original post, the code below prepares matrix y
:
y <- t(sapply(strsplit(x[,2],""), tolower))
rownames(y) <- x[,1]
Then as.DNAbin(y)
shows:
4 DNA sequences in binary format stored in a matrix.
All sequences of same length: 83
Labels: 55548 43297 35309 34468
Base composition:
a c g t
0.289 0.262 0.205 0.244