Search code examples
c#compressionhuffman-codelossless-compression

How to compress an alphabet consisting of DNA sequence


I want to compress a DNA sequence with a compression technique rather than Huffman and Adaptive Huffman algorithm, i'm using c# as a programming language. can anyone lead me to an algorithm. Note: I want a lossless compression


Solution

  • With DNA sequences you have 4 possible states, namely

    • Guanine (G, 00)
    • Cytosine (C, 01)
    • Adenine (A, 10)
    • Thymine (T, 11)

    You can use two bits to store those four possible states with the values in brackets. With this simple method you will be able to store four distinct values in one byte.


    Update
    As @kol mentioned you could then use practically any compression algorithm to further shrink the data. Currently .NET ships with two compression methods (Deflate and GZip) and more can be found in the SharpZipLib open source library