c++data-structures compression lossless-compression

arithmetic encoder symbol lookup

I have working simplish arithmetic encoder based on Drdobbs ARcoder. The de/encoding algorithm boils down to maintaining symbol probabilities in array and updating them. The problem is that I would like to expand it to support growing number of symbols than fixed +256 or so symbols. The current implementation of update_model() and getsym() are too slow if number of symbols increase too much.

Can I convert the cumulative_freq array into tree like data-structure for faster lookup and updating? I don't mind if the algorithm ends up consuming more memory as long as it scales better.

const u32 MAX_INPUT_VAL = 255;
const u32 CODE_RESCALE = MAX_INPUT_VAL+1;
const u32 CODE_EOF = MAX_INPUT_VAL+2;
const u32 CUMULATIVE_TOTAL = MAX_INPUT_VAL+3;  // total sum of all frequencies before.
const u32 CODES_END = MAX_INPUT_VAL+4;

u32 cumulative_freq[CODES_END];

// (Init symbol freqs)
for(u32 i = 0; i < CODES_END; ++i) {
    cumulative_freq[i] = i;
}

// update value propability
void update_model(u32 v) {
    // how can I make this scale better for more symbols?
    for(code_t i = v + 1; i < CODES_END; ++i) {
        cumulative_freq[i]++;
    }
}

struct sym {
    u32 low;
    u32 high;
    u32 count;
};

// encode symbol: get symbol propability
sym propability(u32 v) {
    sym p{cumulative_freq[v], cumulative_freq[v+1], cumulative_freq[CUMULATIVE_TOTAL]};
    update_model(v);
    return p;
}

// decode symbol:
std::pair<sym, u32> getsym(u64 prop) {
    // how can I make this scale better for more symbols?
    for(u32 i = 0; i < CUMULATIVE_TOTAL; ++i) {
        if(prop < cumulative_freq[i+1]) {
            sym p{cumulative_freq[i],cumulative_freq[i+1],cumulative_freq[CUMULATIVE_TOTAL]};
            update_model(i);
            return {p,i};
        }
    }
    // error: decode failed.
}

Solution

Yes, you can arrange your counts into a tree. You have a leaf for every symbol that contains its probability (not cumulative), and then each internal node contains the sum of probabilities from the nodes beneath it.

To update the model, you need to add 1 to the symbol's leaf and each of its ancestors. To search, you can dynamically calculate the cumulative probability for the symbol as you descend into the tree -- every time you go right you add the count from the left child you didn't enter.

This is an order statistic tree, but using cumulative probabilities instead of rank: https://en.wikipedia.org/wiki/Order_statistic_tree

The structure of this tree never changes, however. There are no inserts or deletes. Because of that, you don't need to use actual node objects with pointers, etc. You can just use an array of counts, arranged like a binary heap array: https://www.geeksforgeeks.org/array-representation-of-binary-heap/

If you have N symbols, then there will be N-1 counts for the internal nodes, followed by N counts for the leaf symbols, and every node except the root will have a parent at floor(index-1)/2.

This structure is still quite memory efficient, all operations take O(log N) time, and the code is quite small.

Update looks like this:

unsigned i = symbol + num_symbols-1;
for (;;) {
    ++counts[i];
    if (!i)
        break;
    i = (i-1)/2;
}

Get cumulative P for symbol:

unsigned P=0;
for (unsigned i=symbol+num_symbols-1; i>0; i=(i-1)/2) {
    if (!(i&1)) {
        //this is a right child, add counts from the left
        P+=counts[i-1];
    }
}

Get symbol by cumulative probability:

unsigned P=0;
unsigned i=0;
while (i<(num_symbols-1)) {
    i=i*2+1; //left child
    if (targetP >= P+counts[i]) {
        //use right child instead
        P+=counts[i];
        ++i;
    }
}
unsigned symbol = i-(num_symbols-1);

Note that the smallest-indexed symbol (num_symbols-1) is NOT the left-most in the tree unless num_symbols is a power of 2, but that should be OK. As long as the encoder and decoder agree, the ordering of the symbols is irrelevant to your application.