I have a fixed 32 bits in which to store as much DNA as possible. The amount of space required to store 1 character of DNA ('A'
, 'C'
, 'G'
or 'T'
) is 2 bits (00
, 01
, 10
, 11
, as there are only 4 combinations).
To store up to 2 characters, (so, A
, C
, G
, T
, AA
, AC
, ..., GG
) there are 20 possible combinations, which we can work out with the function ((4**(x+1))-2)/(4-1)
, where x
is the maximum length of DNA we want to store. 16 characters of DNA would therefore have 5,726,623,060 combinations, however in 32 bits I can only store up to 4,294,967,296 numbers (2**32).
So long story short, in 32 bits the maximum amount of variable-length DNA one can store is 15 letters (1,431,655,764 combinations).
So, next step is to make a function which takes up to 15 letters of DNA as a string, and turns it into a number. It doesn't matter which number ('A'
could be 0
, it could be 1
, it could be 1332904
, it really doesn't matter) so long as we can reverse the function and get the number back to 'A'
later.
I started to solve this by making a dictionary of key, value
pairs containing 1,431,655,764 elements, but quickly ran out of RAM. This is why I need a translation function from string to int.
Here is my suggestion.
If storing your letters takes 2 to 30 bits, you have at least 2 bits left to help you to deduce the length. Always add a 1
after the bits representing your characters, and pad the rest with zeroes. That way, if you look for the last 1
in your bit pattern, it will always be just after the end of your string.
E.g.
A -> 00 (meaning 'A') followed by 1 (meaning 'end') followed by 29 zeroes
-> 00100000000000000000000000000000
C -> 01 (meaning 'C') followed by 1 (meaning 'end') followed by 29 zeroes
-> 01100000000000000000000000000000
G -> 10 (meaning 'G') followed by 1 (meaning 'end') followed by 29 zeroes
-> 10100000000000000000000000000000
T -> 11 (meaning 'T') followed by 1 (meaning 'end') followed by 29 zeroes
-> 11100000000000000000000000000000
AA -> 0000 (meaning 'AA') followed by 1, followed by 27 zeroes
-> 00001000000000000000000000000000
...
AAAAAAAAAAAAAAA
-> 000000000000000000000000000000 (meaning the 'A's) followed by 1, followed by 1 zero
-> 00000000000000000000000000000010
That ought to unambiguously represent your string and allow you up to 15 characters in 32 bits.