Search code examples
algorithmmathrandomrandom-access

random increasing sequence with O(1) access to any element?


I have an interesting math/CS problem. I need to sample from a possibly infinite random sequence of increasing values, X, with X(i) > X(i-1), with some distribution between them. You could think of this as the sum of a different sequence D of uniform random numbers in [0,d). This is easy to do if you start from the first one and go from there; you just add a random amount to the sum each time. But the catch is, I want to be able to get any element of the sequence in faster than O(n) time, ideally O(1), without storing the whole list. To be concrete, let's say I pick d=1, so one possibility for D (given a particular seed) and its associated X is:

D={.1, .5, .2, .9, .3, .3, .6 ...}  // standard random sequence, elements in [0,1)
X={.1, .6, .8, 1.7, 2.0, 2.3, 2.9, ...} // increasing random values; partial sum of D

(I don't really care about D, I'm just showing one conceptual way to construct X, my sequence of interest.) Now I want to be able to compute the value of X[1] or X[1000] or X[1000000] equally fast, without storing all the values of X or D. Can anyone point me to some clever algorithm or a way to think about this?

(Yes, what I'm looking for is random access into a random sequence -- with two different meanings of random. Makes it hard to google for!)


Solution

  • Since D is pseudorandom, there’s a space-time tradeoff possible: O(sqrt(n))-time retrievals using O(sqrt(n)) storage locations (or, in general, O(n**alpha)-time retrievals using O(n**(1-alpha)) storage locations). Assume zero-based indexing and that X[n] = D[0] + D[1] + ... + D[n-1]. Compute and store

    Y[s] = X[s**2]
    

    for all s**2 <= n in the range of interest. To look up X[n], let s = floor(sqrt(n)) and return Y[s] + D[s**2] + D[s**2+1] + ... + D[n-1].

    EDIT: here's the start of an approach based on the following idea.

    Let Dist(1) be the uniform distribution on [0, d) and let Dist(k) for k > 1 be the distribution of the sum of k independent samples from Dist(1). We need fast, deterministic methods to (i) pseudorandomly sample Dist(2**p) and (ii) given that X and Y are distributed as Dist(2**p), pseudorandomly sample X conditioned on the outcome of X + Y.

    Now imagine that the D array constitutes the leaves of a complete binary tree of size 2**q. The values at interior nodes are the sums of the values at their two children. The naive way is to fill the D array directly, but then it takes a long time to compute the root entry. The way I'm proposing is to sample the root from Dist(2**q). Then, sample one child according to Dist(2**(q-1)) given the root's value. This determines the value of the other, since the sum is fixed. Work recursively down the tree. In this way, we look up tree values in time O(q).

    Here's an implementation for Gaussian D. I'm not sure it's working properly.

    import hashlib, math
    
    def random_oracle(seed):
        h = hashlib.sha512()
        h.update(str(seed).encode())
        x = 0.0
        for b in h.digest():
            x = ((x + b) / 256.0)
        return x
    
    def sample_gaussian(variance, seed):
        u0 = random_oracle((2 * seed))
        u1 = random_oracle(((2 * seed) + 1))
        return (math.sqrt((((- 2.0) * variance) * math.log((1.0 - u0)))) * math.cos(((2.0 * math.pi) * u1)))
    
    def sample_children(sum_outcome, sum_variance, seed):
        difference_outcome = sample_gaussian(sum_variance, seed)
        return (((sum_outcome + difference_outcome) / 2.0), ((sum_outcome - difference_outcome) / 2.0))
    
    def sample_X(height, i):
        assert (0 <= i <= (2 ** height))
        total = 0.0
        z = sample_gaussian((2 ** height), 0)
        seed = 1
        for j in range(height, 0, (- 1)):
            (x, y) = sample_children(z, (2 ** j), seed)
            assert (abs(((x + y) - z)) <= 1e-09)
            seed *= 2
            if (i >= (2 ** (j - 1))):
                i -= (2 ** (j - 1))
                total += x
                z = y
                seed += 1
            else:
                z = x
        return total
    
    def test(height):
        X = [sample_X(height, i) for i in range(((2 ** height) + 1))]
        D = [(X[(i + 1)] - X[i]) for i in range((2 ** height))]
        mean = (sum(D) / len(D))
        variance = (sum((((d - mean) ** 2) for d in D)) / (len(D) - 1))
        print(mean, math.sqrt(variance))
        D.sort()
        with open('data', 'w') as f:
            for d in D:
                print(d, file=f)
    if (__name__ == '__main__'):
        test(10)