Search code examples
pythonbioinformaticsdna-sequence

Function to store strings as ints


I have a fixed 32 bits in which to store as much DNA as possible. The amount of space required to store 1 character of DNA ('A', 'C', 'G' or 'T') is 2 bits (00, 01, 10, 11, as there are only 4 combinations).

To store up to 2 characters, (so, A, C, G, T, AA, AC, ..., GG) there are 20 possible combinations, which we can work out with the function ((4**(x+1))-2)/(4-1), where x is the maximum length of DNA we want to store. 16 characters of DNA would therefore have 5,726,623,060 combinations, however in 32 bits I can only store up to 4,294,967,296 numbers (2**32).

So long story short, in 32 bits the maximum amount of variable-length DNA one can store is 15 letters (1,431,655,764 combinations).

So, next step is to make a function which takes up to 15 letters of DNA as a string, and turns it into a number. It doesn't matter which number ('A' could be 0, it could be 1, it could be 1332904, it really doesn't matter) so long as we can reverse the function and get the number back to 'A' later.

I started to solve this by making a dictionary of key, value pairs containing 1,431,655,764 elements, but quickly ran out of RAM. This is why I need a translation function from string to int.


Solution

  • Here is my suggestion.

    If storing your letters takes 2 to 30 bits, you have at least 2 bits left to help you to deduce the length. Always add a 1 after the bits representing your characters, and pad the rest with zeroes. That way, if you look for the last 1 in your bit pattern, it will always be just after the end of your string.

    E.g.

    A  -> 00 (meaning 'A') followed by 1 (meaning 'end') followed by 29 zeroes
       -> 00100000000000000000000000000000
    C  -> 01 (meaning 'C') followed by 1 (meaning 'end') followed by 29 zeroes
       -> 01100000000000000000000000000000
    G  -> 10 (meaning 'G') followed by 1 (meaning 'end') followed by 29 zeroes
       -> 10100000000000000000000000000000
    T  -> 11 (meaning 'T') followed by 1 (meaning 'end') followed by 29 zeroes
       -> 11100000000000000000000000000000
    AA -> 0000 (meaning 'AA') followed by 1, followed by 27 zeroes
       -> 00001000000000000000000000000000
    ...
    AAAAAAAAAAAAAAA
       -> 000000000000000000000000000000 (meaning the 'A's) followed by 1, followed by 1 zero
       -> 00000000000000000000000000000010
    

    That ought to unambiguously represent your string and allow you up to 15 characters in 32 bits.