upper bound on predictability

I'm trying to compute the upper bound on the predictability of my occupancy dataset, as in Song's 'Limits of Predictability in Human Mobility' paper. Basically, home (=1) and not at home (=0) then represent the visited locations (towers) in Song's paper.

I tested my code (which I derived from https://github.com/gavin-s-smith/MobilityPredictabilityUpperBounds and https://github.com/gavin-s-smith/EntropyRateEst) on a random binary sequence which should return an entropy of 1 and a predictability of 0.5. Instead, the returned entropy is 0.87 and the predictabiltiy 0.71.

Here's my code:

import numpy as np
from scipy.optimize import fsolve
from cmath import log 
import math

def matchfinder(data):
    data_len = len(data)    
    output = np.zeros(len(data))
    output[0] = 1

    # Using L_{n} definition from
    #"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
    # by Kontoyiannis et. al.
    # $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$

    # for each position, i, in the sub-sequence that occurs before the current position, start_idx
    # check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx

    for start_idx in range(1,data_len):
        max_subsequence_matched = 0
        for i in range(0,start_idx):
            #    for( int i = 0; i < start_idx; i++ )
            #    {
            j = 0

            #increase the length of the substring starting at j and start_idx
            #while they are the same keeping track of the length
            while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
                j = j + 1

            if j > max_subsequence_matched:     
                max_subsequence_matched = j;

        #L_{n} is obtained by adding 1 to the longest match-length
        output[start_idx] = max_subsequence_matched + 1;    

    return output

if __name__ == '__main__':
    #Read dataset            
    data = np.random.randint(2,size=2000)

    #Number of distinct locations
    N = len(np.unique(data))

    #True entropy
    lambdai = matchfinder(data)
    Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)

    S = Etrue
    #use Fano's inequality to compute the predictability
    func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S 
    ub = fsolve(func, 0.9)[0]
    print ub

the matchfinder function finds the entropy by looking for the longest match and adds 1 to it (= the shortest substring not previously seen). The predictability is then computed by using Fano's inequality.

What could be the problem?

Thanks!

Solution

The entropy function seems to be wrong. Refering to the paper Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018–1021. you mentioned, real entropy is estimated by algorithm based on Lempel-Ziv data compression:

$S^{est}=\left ( \frac{1}{n} \sum_{i=1}^{n}\Lambda_i\right )^{-1}\ln(n)$

In code it would look like this:

Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real

Where n is the length of time series.

Notice that we used different base for logarithm than in given formula. However, since the base for logarithm in Fano's inequality is 2, then it seems logical to use the same base for entropy calculation. Also, I'm not sure why you started sum from the first instead of zero index.

So now wrapping that up into function for example:

def solve(locations, size):
    data = np.random.randint(locations,size=size)
    N = len(np.unique(data))
    n = float(len(data))
    print "Distinct locations: %i" % N
    print "Time series length: %i" % n

    #True entropy
    lambdai = matchfinder(data)
    #S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
    Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
    S = Etrue
    print "Maximum entropy: %2.5f" % log(locations,2).real
    print "Real entropy: %2.5f" % S

    func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
    ub = fsolve(func, 0.9)[0]
    print "Upper bound of predictability: %2.5f" % ub
    return ub

Output for 2 locations

Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013

Output for 3 locations

Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172

Lempel-Ziv compression converge to real entropy when n approaches infinity, that is why for 2 locations case it is slightly higher than maximum limit.

I am not also sure if you interpreted definition for lambda correctly. It is defined as "the length of the shortest substring starting at position i which dosen't previously appear from position 1 to i-1", so when we got to some point where further substrings are not unique anymore, your matching algorithm would give it length always one higher than the length of substring, while it should be rather equal to 0, since unique substring does not exist.

To make it clearer let's just give a simple example. If the array of positions looks like that:

[1 0 0 1 0 0]

Then we can see that after first three positions pattern is repeated once again. That means that from fourth location shorthest unique substring does not exist, thus it equals to 0. So the output (lambda) should look like this:

[1 1 2 0 0 0]

However, your function for that case would return:

[1 1 2 4 3 2]

I rewrote you matching function to treat that problem:

def matchfinder2(data):
    data_len = len(data)
    output = np.zeros(len(data))
    output[0] = 1
    for start_idx in range(1,data_len):
        max_subsequence_matched = 0
        for i in range(0,start_idx):
            j = 0
            end_distance = data_len - start_idx #length left to the end of sequence (including current index)
            while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
                j = j + 1
            if j == end_distance: #check if j has reached the end of sequence
                output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros
                return output #end function
            elif j > max_subsequence_matched:
                max_subsequence_matched = j;
        output[start_idx] = max_subsequence_matched + 1;
    return output

Differences are small of course, because result change just for the small part of sequences.