I have an 1D array A that represents categorical data (where each entry is the number of element of a certain category):
A = array([ 1, 8, 2, 5, 10, 32, 0, 0, 1, 0])
and I am trying to write a function sample(A, N) to generate an array B that contains N elements generated by randomly drawing elements from A (keeping the categories):
>>> sample(A, 20)
array([ 1, 3, 0, 1, 4, 11, 0, 0, 0, 0])
I wrote this:
def sample(A, N):
AA = A.astype(float).copy()
Z = zeros(A.shape)
for _ in xrange(N):
drawn = random.multinomial(1, AA/AA.sum())
Z = Z + drawn
AA = AA - drawn
return Z.astype(int)
Probably it is quite naive, is there a better/faster way to do it? Maybe using some fast numpy function? Edit: It was not clear: it has to be without replacement!!!
faster than the other's as far as i can see. But probably uses more memory.
import random
from collections import Counter
def sample2(A,N):
distribution = [i for i, j in enumerate(A) for _ in xrange(j)]
sample = Counter(random.sample(distribution, N))
return [sample[i] for i in xrange(len(A))]
In [52]: A = np.random.randint(0, 100, 500)
In [53]: %timeit sample(A, 100) #Original
100 loops, best of 3: 2.71 ms per loop
In [54]: %timeit sample2(A, 100) #my function
1000 loops, best of 3: 914 µs per loop
In [55]: %timeit sample3(A, 100) #sftd function
100 loops, best of 3: 8.33 ms per loop