I want to select some items from a list depending on a given weight for each element. The length of the output is not known. This needs to be done a lot of times.
So, say I have a list of [id, probability]
like [[1, 0.2], [2, 0.3], [3, 0.45], [4, 0.05], [5,1.]]
, I want to get something like [[1,3,5], [5], [3,5], [5], [1,2,4,5], ...]
Here's my code to do so. It works, but it's pretty slow (the list is long, more than 10,000 elements of [id, probability]
, and my result is thousands of selection
long). Do you know of any way to make this (much) faster?
import numpy as np
items = [[1, 0.2], [2, 0.3], [3, 0.45], [4, 0.05], [5,1.]]
combinations = []
for n in range(1000):
selection = []
for i in items:
chosen = np.random.choice([True, False], p=[i[1], 1.-i[1]])
if chosen:
selection.append(i[0])
combinations.append(selection)
You can vectorize the sampling step as follows:
import numpy as np
# items = [[1, 0.2], [2, 0.3], [3, 0.45], [4, 0.05], [5,1.]]
items = [(i, np.random.rand()) for i in range(1000)]
def sample_original(itms, n=1000):
combinations = []
for n in range(n):
selection = []
for i in items:
chosen = np.random.choice([True, False], p=[i[1], 1.-i[1]])
if chosen:
selection.append(i[0])
combinations.append(selection)
def sample_numpy(itms, n=1000):
elts, probs = np.array(itms).T
m = len(elts)
return [elts[np.random.rand(m) < probs] for _ in range(n)]
The main observation is np.random.rand(m) < probs
gives a random vector of True/False
that selects elements of the original list with correct probabilities. This seems to be 1000x faster when there are 1000 items:
%timeit sample_numpy(items)
%timeit sample_original(items)
10 loops, best of 3: 22.6 ms per loop
1 loop, best of 3: 21.9 s per loop
If the probabilities are relatively high (resulting selections are not sparse), you might want to store the selection indicators in a 2D array for further performance gains.