I have a 3D numpy array with the probabilities of each category in the last dimension. Something like:
import numpy as np
from scipy.special import softmax
array = np.random.normal(size=(10, 100, 5))
probabilities = softmax(array, axis=2)
How can I sample from a categorical distribution with those probabilities?
EDIT: Right now I'm doing it like this:
def categorical(x):
return np.random.multinomial(1, pvals=x)
samples = np.apply_along_axis(categorical, axis=2, arr=probabilities)
But it's very slow so I want to know if there's a way to vectorize this operation.
Drawing samples from a given probability distribution is done by building the evaluating the inverse cumulative distribution for a random number in the range 0 to 1. For a small number of discrete categories - like in the question - you can find the inverse using a linear search:
## Alternative test dataset
probabilities[:, :, :] = np.array([0.1, 0.5, 0.15, 0.15, 0.1])
n1, n2, m = probabilities.shape
cum_prob = np.cumsum(probabilities, axis=-1) # shape (n1, n2, m)
r = np.random.uniform(size=(n1, n2, 1))
# argmax finds the index of the first True value in the last axis.
samples = np.argmax(cum_prob > r, axis=-1)
print('Statistics:')
print(np.histogram(samples, bins=np.arange(m+1)-0.5)[0]/(n1*n2))
For the test dataset, a typical test output was:
Statistics:
[0.0998 0.4967 0.1513 0.1498 0.1024]
which looks OK.
If you have many, many categories (thousands), it's probably better to do a bisection search using a numba compiled function.