Search code examples
pythonnumpystatisticsscipysampling

Down-sampling with numpy


I have an 1D array A that represents categorical data (where each entry is the number of element of a certain category):

A = array([ 1, 8, 2, 5, 10, 32, 0, 0, 1, 0])

and I am trying to write a function sample(A, N) to generate an array B that contains N elements generated by randomly drawing elements from A (keeping the categories):

>>> sample(A, 20)
array([ 1, 3, 0, 1, 4, 11, 0, 0, 0, 0])

I wrote this:

def sample(A, N):
    AA = A.astype(float).copy()
    Z = zeros(A.shape)
    for _ in xrange(N):
        drawn = random.multinomial(1, AA/AA.sum())
        Z = Z + drawn
        AA = AA - drawn
    return Z.astype(int)

Probably it is quite naive, is there a better/faster way to do it? Maybe using some fast numpy function? Edit: It was not clear: it has to be without replacement!!!


Solution

  • faster than the other's as far as i can see. But probably uses more memory.

    import random 
    from collections import Counter
    
    def sample2(A,N):
        distribution = [i for i, j in enumerate(A) for _ in xrange(j)]
        sample = Counter(random.sample(distribution, N))
        return [sample[i] for i in xrange(len(A))]
    
    
    In [52]: A = np.random.randint(0, 100, 500)
    
    In [53]: %timeit sample(A, 100) #Original
    100 loops, best of 3: 2.71 ms per loop
    
    In [54]: %timeit sample2(A, 100) #my function
    1000 loops, best of 3: 914 µs per loop
    
    In [55]: %timeit sample3(A, 100) #sftd function
    100 loops, best of 3: 8.33 ms per loop