python multithreading python-multiprocessing random-walk

python library for evaluating random walks?

I'm trying to evaluate the probabilities of end locations of random walks but I'm having some trouble with the speed of my program. Basically what I'm trying to do is take as an input a dictionary that contains the probabilities for a random walk( e.g. p = {0:0.5, 1:0.2. -1:0.3} meaning there's a 50% probability X stays at 0, a 20% probability X increases by 1, and a 30% probability X decreases by 1) and then calculate the probabilities for all the possible future states after n iterations.

So for example if p = {0:0.5, 1:0.2. -1:0.3} and n = 2 then it will return {0:0.37, 1:0.2, -1:0.3, 2:0.04, -2:0.09} if p = {0:0.5, 1:0.2. -1:0.3} and n = 1 then it will return {0:0.5, 1:0.2. -1:0.3}

I have working code, and it runs relatively quickly if n is low and if the p dictionary is small, but when n > 500 and the dictionary has around 50 values it takes upwards of 5 minutes to calculate. I'm guessing this is because it does it only on one processor so I went ahead and modified it so it would use python's multiprocessing module (as I read that multithreading doesn't improve parallel computing performance because of GIL).

My problem is, that there is not much improvement with multiprocessing, now I'm not sure if it's because I'm implementing it wrong or because of the overhead of multiprocessing in python. I'm just wondering if there's a library somewhere that evaluates all the probabilities of all the possibilities of a random walk when n > 500 in parallel? My next step if I can't find anything is to write my own function as an extension in C but it will be my first time doing it and although I've coded in C before it has been a while.

Original Non MultiProcessed Code

def random_walk_predictor(probabilities_tree, period):
    ret = probabilities_tree
    probabilities_leaves = ret.copy()
    for x in range(period):
        tmp = {}
        for leaf in ret.keys():
            for tree_leaf in probabilities_leaves.keys():
                try:
                    tmp[leaf + tree_leaf] = (ret[leaf] * probabilities_leaves[tree_leaf]) + tmp[leaf + tree_leaf]
                except:
                    tmp[leaf + tree_leaf] = ret[leaf] * probabilities_leaves[tree_leaf]
        ret = tmp
    return ret

MultiProcessed code

from multiprocessing import Manager,Pool
from functools import partial

def probability_calculator(origin, probability, outp, reference):
    for leaf in probability.keys():
        try:
            outp[origin + leaf] = outp[origin + leaf] + (reference[origin] * probability[leaf])
        except KeyError:
            outp[origin + leaf] = reference[origin] * probability[leaf]

def random_walk_predictor(probabilities_leaves, period):
    probabilities_leaves = tree_developer(probabilities_leaves)
    manager = Manager()
    prob_leaves = manager.dict(probabilities_leaves)
    ret = manager.dict({0:1})
    p = Pool()

    for x in range(period):
        out = manager.dict()
        partial_probability_calculator = partial(probability_calculator, probability = prob_leaves, outp = out, reference = ret.copy())

        p.map(partial_probability_calculator, ret.keys())
        ret = out

    return ret.copy()

Solution

There tend to be analytic solutions to exactly solve this kind of problem that look similar to binomial distributions, but I'll assume you're really asking for a computational solution for a more general class of problem.

Rather than using python dictionaries, it's easier to think about this in terms of the underlying mathematical problem. Build a matrix A that describes the probability of going from one state to another. Build a state x that describes the probability of being at a given location at some time.

Because after n transitions you can step at most n steps from the origin (in either direction) - your state needs to have 2n+1 rows, and A needs to be square and of size 2n+1 by 2n+1.

For a two timestep problem your transition matrix will be 5x5 and look like:

[[ 0.5  0.2  0.   0.   0. ]
 [ 0.3  0.5  0.2  0.   0. ]
 [ 0.   0.3  0.5  0.2  0. ]
 [ 0.   0.   0.3  0.5  0.2]
 [ 0.   0.   0.   0.3  0.5]]

And your state at time 0 will be:

[[ 0.]
 [ 0.]
 [ 1.]
 [ 0.]
 [ 0.]]

The one step evolution of the system can be predicted by multiplying A and x.

So at t = 1,

 x.T = [[ 0.   0.2  0.5  0.3  0. ]]

and at t = 2,

x.T = [[ 0.04  0.2   0.37  0.3   0.09]]

Because for even modest numbers of timesteps this is potentially going to take a fair bit of storage (A requires n^2 storage), but is very sparse, we can use sparse matrices to reduce our storage (and speed up our calculations). Doing this means A requires approximate 3n elements.

import scipy.sparse as sp
import numpy as np

def random_walk_transition_probability(n, left = 0.3, centre = 0.5, right = 0.2):
    m = 2*n+1
    A  = sp.csr_matrix((m, m))
    A += sp.diags(centre*np.ones(m), 0)
    A += sp.diags(left*np.ones(m-1), -1)
    A += sp.diags(right*np.ones(m-1),  1)
    x = np.zeros((m,1))
    x[n] = 1.0
    for i in xrange(n):
        x = A.dot(x)
    return x

print random_walk_transition_probability(4)

Timings

%timeit random_walk_transition_probability(500)
100 loops, best of 3: 7.12 ms per loop

%timeit random_walk_transition_probability(10000)
1 loops, best of 3: 1.06 s per loop