DNA strings consist of an alphabet of four characters, A,C,G, and T
Given a string,
ATGTTTAAA
I would like to split it in to its constituent codons
ATG TTT AAA
codons = ["ATG","TTT","AAA"]
codons encode proteins and they are redundant (http://en.wikipedia.org/wiki/DNA_codon_table)
I have a DNA string in D and would like to split it into a range of codons and later translate/map the codons to amino acids.
std.algorithm has a splitter function which requires a delimiter and also the std.regex Splitter function requires a regex to split the string. Is there an idiomatic approach to splitting a string without a delimiter?
Looks like you are looking for chunks
:
import std.range : chunks;
import std.encoding : AsciiString;
import std.algorithm : map;
AsciiString ascii(string literal)
{
return cast(AsciiString) literal;
}
void main()
{
auto input = ascii("ATGTTTAAA");
auto codons = input.chunks(3);
auto aminoacids = codons.map!(
(codon) {
if (codon == ascii("ATG"))
return "M";
// ...
}
);
}
Please note that I am using http://dlang.org/phobos/std_encoding.html#.AsciiString here instead of plain string literals. This is to avoid costly UTF-8 decoding which is done for string
and is never applicable to actual DNA sequence. I remember that making notable performance difference for similar bioinformatics code before.