Search code examples
calgorithmbase36

Curious about base36 encoding


I want to base36 encode 128bit hexadecimal number, but 128bit exceeds the range of the largest number supported by c language. So, i can't get the value of 36 numbers by finding the remainder and finding the quotient.

I'm curious about the internal algorithm in base36 that handles such long strings. I am wondering how to express the number that cannot be expressed in the number range of c language. Can you tell me about the algorithm for base36? Or, I would like a book or site for reference.


Solution

  • I am wondering how to express the number that cannot be expressed in the number range of c language.

    Consider an array of bytes. Those bytes consist of bits. Now, assume that the byte consists of 8 bits. Next, consider an array of 16 bytes. There are a total of 128 bits in that array. You can use the lowest element to represent the first 8 bits of a 128 bit integer, the next element to represent bits 8...15 and so on.

    That is how arbitrarily large integers can be represented in C: Using arrays of smaller integers each of the elements representing a high radix digit. In the scheme that I described, the number is represented using a radix of 256. You don't necessarily need to use an array of bytes. Typically, arbitrary precision math uses elements of the CPU word size for efficiency. In case of 32 bit element for example, that would be a radix of 4'294'967'296.

    In the base36 encoding, the radix i.e. the base of the represenation is - you may have guessed this - 36. It is a textual representation, so elements of the array are chars. Instead of using all 256 values, this representation only uses 36 of those values; specifically those values that encode the upper case latin letter characters and the arabic numeral characters.

    Can you tell me about the algorithm for base36?

    There are essentially two steps:

    • First convert the input data to radix-36.
    • Next map those digits to text so that the digit 0 maps to the character '0', the digit 10 maps to 'A' and 35 maps to 'Z'. You can iterpolate the mappings that I did not provide.