Search code examples
algorithmlanguage-agnosticcompressionlossless-compressioncontinued-fractions

Good compression scheme for continued fraction terms?


So I'm implementing a continued fraction library for handling a subset of quadratic integers and rational numbers. Continued fraction terms are represented by unsigned integers. I've noticed the following general patterns when working with continued fractions:

  1. Most terms are small single digit numbers, with 1 being the most common.
  2. Some terms can be enormous, the largest possible for my application is 366 bits, but these are extremely rare.
  3. A large term represents an especially good number approximation, which means there are usually fewer terms overall for a large fraction.
  4. The worst possible continued fraction is the golden ratio, and a rational approximation of that with 366 bits of precision corresponds to roughly 525 1's in a row.
  5. Random rational numbers do not generally have large runs of the same number, but may have two to four in a row, with 1 again the most common.

So I have a limit on both the number of terms and the size of a term, with the number of terms roughly inversely proportional to their size. So storing these terms in machine words or even bytes is usually very wasteful of space, and I still need to handle multi-word arithmetic in the worst case. Given the roughly inverse relationship between term size and number of terms (which both also depend on the size of the fraction's numerator and denominator), I've been trying to find or come up with a good compression scheme so that I don't waste so much space storing the integer terms.

I've considered Huffman encoding, as the speed of encoding and decoding is nice, but I'm not sure how to come up with the probabilities for the code values. I have a vague intuition that Stern-Brocot Trees might offer a hint, as they are binary trees with a direct relationship to continued fractions.

Does anyone know of a good compression approach for handling lots of small numbers with occasional huge ones where runs of the same number are typically short (but might be long in rare cases)? In particular I need to be able to encode and decode fairly fast (say O(n*lg(n)) is the absolute worst speed I can tolerate, with O(n) or better preferable), and be able to get at the position of individual terms so that I know which term number I'm operating on (fourth term, fifth term, etc).

Also, I'm not interested in using any third party real number or continued fraction libraries. I've looked at several and they are either insufficient or overkill for my needs, and I'd like the experience of coding this up myself for my own edification.

Update

I've also learned of Gauss–Kuzmin distribution which gives the probability distribution of a particular continued fraction term of k for a random number uniformly distributed between 0 and 1. It is

P(k) = -lg(1.0 - 1.0/((k + 1.0)**2) in Python pseudo-code, where lg is the base 2 logarithm. This is the limit as k approaches infinity, so my case is somewhat more restricted (though 2**366 is still huge). "The Entropy of Continued Fractions" by Linas Vepstas gives the (information) entropy of Gauss-Kuzmin distribution as approximately 3.43 bits. My maximum denominator is large enough that the information entropy of my continued fractions is probably close to that limit, though the graph in the linked article shows the limit is approached quite slowly and so is different for relatively small denominators.


Solution

  • One possibility is a simple prefix code where the binary number 1x is has the bits x encoded as 0 -> 10 and 1 -> 11, followed by the terminator 0.

    Table:

    1 -> 0
    2 -> 100
    3 -> 110
    4 -> 10100
    5 -> 10110
    6 -> 11100
    7 -> 11110
    8 -> 1010100
    

    The corresponding Huffman code probabilities here for n are something like Theta(1/n^2) (waving my hands a bit because the alphabet is infinite).