Search code examples
pythonpython-3.xnumpyrandomprobability

Downsample numpy array while preserving distribution


I'm trying to write a function that can randomly sample a numpy.ndarray that has floating point numbers while preserving the distribution of the numbers in the array. I have this function for now:

import random
from collections import Counter

def sample(A, N):
    population = np.zeros(sum(A))
    counter = 0
    for i, x in enumerate(A):
            for j in range(x):
                    population[counter] = i
                    counter += 1

    sampling = population[np.random.choice(0, len(population), N)]
    return np.histogram(sampling, bins = np.arange(len(A)+1))[0]

So I would like the function to work something like this(doesn't include accounting for distribution for this example):

a = np.array([1.94, 5.68, 2.77, 7.39, 2.51])
new_a = sample(a,3)

new_a
array([1.94, 2.77, 7.39])

However, when I apply the function to an array like this I'm getting:

TypeError                                 Traceback (most recent call last)
<ipython-input-74-07e3aa976da4> in <module>
----> 1 sample(a, 3)

<ipython-input-63-2d69398e2a22> in sample(A, N)
      3 
      4 def sample(A, N):
----> 5     population = np.zeros(sum(A))
      6     counter = 0
      7     for i, x in enumerate(A):

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Any help on modifying or create a function that would work for this would be really appreciated!


Solution

  • In [67]: a = np.array([1.94, 5.68, 2.77, 7.39, 2.51])                                                  
    In [68]: np.zeros(sum(a))                                                                              
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-68-263779bc977b> in <module>
    ----> 1 np.zeros(sum(a))
    
    TypeError: 'numpy.float64' object cannot be interpreted as an integer
    

    sum on the shape does not produce this error:

    In [69]: np.zeros(sum(a.shape))                                                                        
    Out[69]: array([0., 0., 0., 0., 0.])
    

    But you shouldn't need to use sum:

    In [70]: a.shape                                                                                       
    Out[70]: (5,)
    In [71]: np.zeros(a.shape)                                                                             
    Out[71]: array([0., 0., 0., 0., 0.])
    

    In fact if a is 2d, and you want a 1d array with the same number of items, you want the product of the shape, not the sum.

    But do you want to return an array exactly the same size as A? I thought you were trying to downsize.