Search code examples
pythonnumpyprobabilitybernoulli-probability

Binomial distributions (Bernoulli trials) with different probabilities


I want to speed up the code below - namely the for loop. Is there a way to do it in numpy?

import numpy as np
# define seend and random state
rng = np.random.default_rng(0)
# num of throws
N = 10**1
# max number of trials
total_n_trials = 10
# generate the throws' distributions of "performace" - overall scores
# mu_throws = rng.normal(0.7, 0.1, N)
mu_throws = np.linspace(0,1,N)

all_trials = np.zeros(N*total_n_trials)
for i in range(N):
    # generate trials per throw as Bernoulli trials with given mean
    all_trials[i*total_n_trials:(i+1)*total_n_trials] = rng.choice([0,1], size=total_n_trials, p=[1-mu_throws[i],mu_throws[i]])
    

More explanations - I want to generate N sequences of Bernoulli trials (ie. 0s and 1s, called throws) where each sequence has a mean (probability p) given by values in another array (mu_throws). This can be sampled from a normal distribution or in this case, I took it for simplicity to be a sequence of N=10 numbers from 0 to 1. The approach above works but is slow. I expect N to be at least $10^4$ and total_n_trials then can be anything in the order of hundreds to (tens of) thousands (and this is run several times). I have checked the following post but didn't find an answer. I also know that numpy random choice can generate multidimensional arrays but I didn't find a way to set a different set of ps for different rows. Basically getting the same as what I do above just reshaped:

all_trials.reshape(N,-1)
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0., 0., 1., 1., 0.],
       [1., 0., 1., 0., 0., 1., 0., 1., 0., 1.],
       [1., 0., 1., 0., 0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 1., 0., 1., 0., 1., 1., 0.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

I also thought about this trick but haven't figured how to use it for Bernoulli trials. Thanks for any tips.


Solution

  • N = 11
    mu_throws = np.linspace(0,1,N)
    total_n_trials = 10_000
    
    rng = np.random.default_rng(0)
    all_trials = (rng.random((N, total_n_trials)).T<mu_throws).T.astype(int)
    all_trials # shape (N, total_n_trials)
    

    output:

    array([[0, 0, 0, ..., 0, 0, 0],
           [0, 0, 0, ..., 0, 0, 0],
           [1, 0, 0, ..., 0, 0, 0],
           ...,
           [1, 0, 1, ..., 1, 1, 1],
           [1, 1, 1, ..., 0, 1, 1],
           [1, 1, 1, ..., 1, 1, 1]])
    

    Basically, what I'm doing is generate random real numbers in the interval [0, 1) and convert them in boolean results in function of a given probability (in mu_throws).

    if you compare the mu_throws (actual probability mean) and the probability estimated in all_trials you have:

    np.c_[mu_throws, all_trials.mean(1)]
    

    output:

    array([[0.    , 0.    ],
           [0.1   , 0.1003],
           [0.2   , 0.1963],
           [0.3   , 0.305 ],
           [0.4   , 0.4006],
           [0.5   , 0.5056],
           [0.6   , 0.5992],
           [0.7   , 0.6962],
           [0.8   , 0.7906],
           [0.9   , 0.8953],
           [1.    , 1.    ]])
    

    For N and total_n_trials values from your example the time required on my machine is 0.00014019012451171875 sec, vs 0.0012738704681396484 sec of your loop.