Generate random number sequence with certain entropy

I need to generate partly random sequence of numbers such that the sequence overall has certain entropy level.

E.g. if I would feed the generated data into gzip it would be able compress it. And in fact, this would be the exact application for the code, testing data compressors.

I'm programming this in C++ and first idea that came into my mind would be to initialize bunch of std::mt19937 PRNGs with random seed and choose one randomly and make random lenght pattern with it. The std::mt19937 is reset each time with same seed so that it always generates same pattern:

#include <iostream>
#include <random>
#include <vector>

int main() {

    std::random_device rd;
    std::vector<std::mt19937> rngs;
    std::vector<int> seeds;

    std::uniform_int_distribution<int> patternrg(0,31);
    std::uniform_int_distribution<int> lenghtrg(1,64);
    std::uniform_int_distribution<int> valuerg(0,255);

    for(int i = 0; i < 32; ++i) {
        seeds.push_back(rd());
        rngs.emplace_back(seeds.back());
    }

    for(;;) {
        // Choose generator and pattern lenght randomly.
        auto gen = patternrg(rd);
        auto len = lenghtrg(rd);
        rngs[gen].seed(seeds[gen]);
        for(int i = 0; i < len; ++i) {
            std::cout << valuerg( rngs[gen] )<<"\n";
        }
    }
}

Above code mets the first requirement of generating compressable randomness but the second is harder: how to control the level entropy/randomness?

Solution

Let me write few sentences which you could find useful. Suppose we want to sample one bit with given entropy. So it is either 0 or 1, and entropy you want is equal to e.

H(10|p) = -p log₂(p) - (1 - p) log₂(1 - p), where p is probability to get 1. Simple test - in case of p=1/2 one would get entropy of 1 - maximum entropy. So you pick e equal to some value below 1, solve equation

-p log₂(p) - (1 - p) log₂(1 - p) = e

and get back p, and then you could start sampling using Bernoulli distribution. Simple demo is here. And in C++ one could use standard library routine.

Ok, suppose you want to sample one byte with given entropy. It has 256 values, and entropy

H(byte|\vec{p}) = -Sum(1...256) p_i log₂(p_i).

Again, if all combination are equiprobable (p_i=1/256), you'll get -256/256 log₂(1/256) = 8, which is maximum entropy. If you now fix your entropy (say, I want it to be 7), then there would be infinite number of solutions for p_i, there is no single unique realization of given entropy.

You could simplify problem a bit - lets consider again one parameter case, where probability to find 1 is p and probability to find 0 is (1-p). Thus, from 256 outcomes we now got 9 of them - 00000000, 00000001, 00000011, 00000111, 00001111, 00011111, 00111111, 01111111, 11111111. For each of those cases we could write probability, compute entropy, assign it to whatever you want and solve back to find p.

Sampling would be relatively easy - first step would be sample on of the 9 combinations via discrete distribution, and second step would be shuffle bits inside the byte using Fisher-Yates shuffle.

Same approach might be used for, say, 32bits or 64bits - you have 33 or 65 cases, construct entropy, assign to whatever you want, find p, sample one of them and then shuffle bits inside sampled value.

No code right now, but I could probably write some code later if there is an interest...

UPDATE

Keep in mind another peculiar property of the fixing entropy. Even for a simple case of single bit, if you try to solve

-p log₂(p) - (1 - p) log₂(1 - p) = e

for a given e, you'll get two answers, and it is easy to understand why - equation is symmetric wrt p and 1-p (or replacing 0s with 1s and 1s with 0s). In other words, for entropy it is irrelevant if you transfer information using mostly zeros, or mostly ones. It is not true for things like natural text.