Search code examples
d

Split a DNA sequence into a list of codons with D


DNA strings consist of an alphabet of four characters, A,C,G, and T Given a string,

ATGTTTAAA

I would like to split it in to its constituent codons

ATG  TTT AAA  

   codons = ["ATG","TTT","AAA"]

codons encode proteins and they are redundant (http://en.wikipedia.org/wiki/DNA_codon_table)

I have a DNA string in D and would like to split it into a range of codons and later translate/map the codons to amino acids.

std.algorithm has a splitter function which requires a delimiter and also the std.regex Splitter function requires a regex to split the string. Is there an idiomatic approach to splitting a string without a delimiter?


Solution

  • Looks like you are looking for chunks:

    import std.range : chunks;
    import std.encoding : AsciiString;
    import std.algorithm : map;
    
    AsciiString ascii(string literal)
    {
        return cast(AsciiString) literal;
    }
    
    void main()
    {
        auto input = ascii("ATGTTTAAA");
        auto codons = input.chunks(3);
        auto aminoacids = codons.map!(
            (codon) {
                if (codon == ascii("ATG"))
                    return "M";
                // ...
            }
        );
    }
    

    Please note that I am using http://dlang.org/phobos/std_encoding.html#.AsciiString here instead of plain string literals. This is to avoid costly UTF-8 decoding which is done for string and is never applicable to actual DNA sequence. I remember that making notable performance difference for similar bioinformatics code before.