c algorithm bit-manipulation bioinformatics micro-optimization

Converting nucleobase representation from ASCII to UCSC .2bit

Unambiguous DNA sequences consist only of the nucleobases adenine (A), cytosine (C), guanine (G), thymine (T). For human consumption, the bases may be represented by the corresponding char in either uppercase or lowercase: A, C, G, T, or a, c, g, t. This representation is inefficient, however, when long sequences need to be stored. Since only four symbols need to be stored, each symbol can be assigned a 2-bit code. The commonly used .2bit-format specified by UCSC does exactly that, using the following encoding: T = 0b00, C = 0b01, A = 0b10, G = 0b11.

The C code below shows a reference implementation written for clarity. Various open-source software that converts genomic sequences represented as a char sequence typically uses a 256-entry lookup table indexed by each char in sequence. This also isolates from the internal representation of char. However, memory access is energetically expensive, even if the access is to an on-chip cache, and generic table look-ups are difficult to SIMDize. It would therefore be advantageous if the conversion could be accomplished by simple integer arithmetic. Given that ASCII is the dominating char encoding, one can restrict to that.

What are efficient computational approaches to convert nucleobases given as ASCII characters to their .2bit-representation?

/* Convert nucleobases A, C, G, T represented as either uppercase or lowercase
   ASCII characters into UCSC .2bit-presentation. Passing any values other than
   those corresponding to 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' results in an
   indeterminate response.
*/
unsigned int ascii_to_2bit (unsigned int a)
{
    const unsigned int UCSC_2BIT_A = 2;
    const unsigned int UCSC_2BIT_C = 1;
    const unsigned int UCSC_2BIT_G = 3;
    const unsigned int UCSC_2BIT_T = 0;

    switch (a) {
    case 'A':
    case 'a':
        return UCSC_2BIT_A;
        break;
    case 'C':
    case 'c':
        return UCSC_2BIT_C;
        break;
    case 'G':
    case 'g':
        return UCSC_2BIT_G;
        break;
    case 'T':
    case 't':
    default:
        return UCSC_2BIT_T;
        break;
    }
}

Solution

One option is the following:

unsigned int ascii_to_2bit (unsigned int a)
{
    return ((0xad - a) & 6) >> 1;
}

This has the advantage that it needs only 8 bits, doesn't overflow, and doesn't contain non-constant shifts, so it can be immediately used to parallelize even without specific SIMD instructions, e.g. with 8 characters input in a 64-bit word

unsigned int ascii_to_2bit_8bytes (uint64_t a)
{
    return ((0xadadadadadadadad - a) & 0x0606060606060606) >> 1;
}

leaving the two output bits at the bottom of each byte.