python numpy optimization vectorization multinomial

Fast vectorized multinomial in python

I am currently using NumPy for the following task: I have a large grid of values and I need to take a multinomial sample at each point. The probability vector for the multinomial will vary from grid site to grid site, so the NumPy multinomial function doesn't quite work for me since it does all of its draws from the same distribution. Iterating over each site seems incredibly inefficient and I was wondering if there was a way to do this efficiently in NumPy. It appears that something like this is possible (and fast) if I use Theano (see this answer), but that would require quite a bit of rewriting, which I would ideally like to avoid. Can multinomial sampling be vectorized efficiently in basic NumPy?

Edit: It is easy to modify Warren's code to allow different counts at different sites, as I have found that I need: all one needs to do is pass in the full count array and remove the first like, like so:

import numpy as np


def multinomial_rvs(count, p):
    """
    Sample from the multinomial distribution with multiple p vectors.

    * count must be an (n-1)-dimensional numpy array.
    * p must an n-dimensional numpy array, n >= 1.  The last axis of p
      holds the sequence of probabilities for a multinomial distribution.

    The return value has the same shape as p.
    """
    out = np.zeros(p.shape, dtype=int)
    ps = p.cumsum(axis=-1)
    # Conditional probabilities
    with np.errstate(divide='ignore', invalid='ignore'):
        condp = p / ps
    condp[np.isnan(condp)] = 0.0
    for i in range(p.shape[-1]-1, 0, -1):
        binsample = np.random.binomial(count, condp[..., i])
        out[..., i] = binsample
        count -= binsample
    out[..., 0] = count
    return out

Solution

Here's one way you can do it. It is not fully vectorized, but the Python loop is over the p values. If the length of your p vector(s) is not too large, this might be sufficiently fast for you.

The multinomial distribution is implemented using repeated calls to np.random.binomial, which implements broadcasting of its arguments.

import numpy as np


def multinomial_rvs(n, p):
    """
    Sample from the multinomial distribution with multiple p vectors.

    * n must be a scalar.
    * p must an n-dimensional numpy array, n >= 1.  The last axis of p
      holds the sequence of probabilities for a multinomial distribution.

    The return value has the same shape as p.
    """
    count = np.full(p.shape[:-1], n)
    out = np.zeros(p.shape, dtype=int)
    ps = p.cumsum(axis=-1)
    # Conditional probabilities
    with np.errstate(divide='ignore', invalid='ignore'):
        condp = p / ps
    condp[np.isnan(condp)] = 0.0
    for i in range(p.shape[-1]-1, 0, -1):
        binsample = np.random.binomial(count, condp[..., i])
        out[..., i] = binsample
        count -= binsample
    out[..., 0] = count
    return out

Here's an example where the "grid" has shape (2, 3), and the multinomial distribution is four dimensional (i.e. each p vector has length 4).

In [182]: p = np.array([[[0.25, 0.25, 0.25, 0.25], 
     ...:                [0.01, 0.02, 0.03, 0.94], 
     ...:                [0.75, 0.15, 0.05, 0.05]], 
     ...:               [[0.01, 0.99, 0.00, 0.00], 
     ...:                [1.00, 0.00, 0.00, 0.00], 
     ...:                [0.00, 0.25, 0.25, 0.50]]])                                                                                                 

In [183]: sample = multinomial_rvs(1000, p)                                     

In [184]: sample                                                                
Out[184]: 
array([[[ 249,  260,  233,  258],
        [   3,   21,   33,  943],
        [ 766,  131,   55,   48]],

       [[   5,  995,    0,    0],
        [1000,    0,    0,    0],
        [   0,  273,  243,  484]]])

In [185]: sample.sum(axis=-1)                                                   
Out[185]: 
array([[1000, 1000, 1000],
       [1000, 1000, 1000]])

In a comment, you said "The p vector is of the form: p = [p_s, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4], with p_s varying from site to site." Here's how you can use the above function, given an array that contains the p_s values.

First create some data for the example:

In [73]: p_s = np.random.beta(4, 2, size=(2, 3))                                                                                                        

In [74]: p_s                                                                                                                                            
Out[74]: 
array([[0.61662208, 0.6072323 , 0.62208711],
       [0.86848938, 0.58959038, 0.47565799]])

Create the array p containing the multinomial probabilities according to the formula p = [p_s, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4]:

In [75]: p = np.expand_dims(p_s, -1) * np.array([1, -0.25, -0.25, -0.25, -0.25]) + np.array([0, 0.25, 0.25, 0.25, 0.25])                                

In [76]: p                                                                                                                                              
Out[76]: 
array([[[0.61662208, 0.09584448, 0.09584448, 0.09584448, 0.09584448],
        [0.6072323 , 0.09819192, 0.09819192, 0.09819192, 0.09819192],
        [0.62208711, 0.09447822, 0.09447822, 0.09447822, 0.09447822]],

       [[0.86848938, 0.03287765, 0.03287765, 0.03287765, 0.03287765],
        [0.58959038, 0.1026024 , 0.1026024 , 0.1026024 , 0.1026024 ],
        [0.47565799, 0.1310855 , 0.1310855 , 0.1310855 , 0.1310855 ]]])

Now do the same as before to generate the sample (change the value 1000 to whatever is appropriate to your problem):

In [77]: sample = multinomial_rvs(1000, p)                                                                                                              

In [78]: sample                                                                                                                                         
Out[78]: 
array([[[618,  92, 112,  88,  90],
        [597, 104, 103, 101,  95],
        [605, 100,  95,  98, 102]],

       [[863,  32,  43,  27,  35],
        [602, 107, 108,  94,  89],
        [489, 130, 129, 129, 123]]])

In [79]: sample.sum(axis=-1)                                                                                                                            
Out[79]: 
array([[1000, 1000, 1000],
       [1000, 1000, 1000]])