Search code examples
pythonnumpysampling

randomly choose different sets in numpy?


I am trying to randomly select a set of integers in numpy and am encountering a strange error. If I define a numpy array with two sets of different sizes, np.random.choice chooses between them without issue:

Set1 = np.array([[1, 2, 3], [2, 4]])
In:  np.random.choice(Set1)
Out: [4, 5]

However, once the numpy array are sets of the same size, I get a value error:

Set2 = np.array([[1, 3, 5], [2, 4, 6]])
In:   np.random.choice(Set2)
ValueError: a must be 1-dimensional    

Could be user error, but I've checked several times and the only difference is the size of the sets. I realize I can do something like:

Chosen = np.random.choice(N, k)
Selection = Set[Chosen]

Where N is the number of sets and k is the number of samples, but I'm just wondering if there was a better way and specifically what I am doing wrong to raise a value error when the sets are the same size.

Printout of Set1 and Set2 for reference:

In: Set1
Out: array([list([1, 3, 5]), list([2, 4])], dtype=object)
In: type(Set1)
Out: numpy.ndarray

In: Set2
Out: 
array([[1, 3, 5],
       [2, 4, 6]])
In: type(Set2)
Out: numpy.ndarray

Solution

  • Your issue is caused by a misunderstanding of how numpy arrays work. The first example can not "really" be turned into an array because numpy does not support ragged arrays. You end up with an array of object references that points to two python lists. The second example is a proper 2xN numerical array. I can think of two types of solutions here.

    The obvious approach (which would work in both cases, by the way), would be to choose the index instead of the sublist. Since you are sampling with replacement, you can just generate the index and use it directly:

    Set[np.random.randint(N, size=k)]
    

    This is the same as

    Set[np.random.choice(N, k)]
    

    If you want to choose without replacement, your best bet is to use np.random.choice, with replace=False. This is similar to, but less efficient than shuffling. In either case, you can write a one-liner for the index:

    Set[np.random.choice(N, k, replace=False)]
    

    Or:

    index = np.arange(Set.shape[0])
    np.random.shuffle(index)
    Set[index[:k]]
    

    The nice thing about np.random.shuffle, though, is that you can apply it to Set directly, whether it is a one- or many-dimensional array. Shuffling will always happen along the first axis, so you can just take the top k elements afterwards:

    np.random.shuffle(Set)
    Set[:k]
    

    The shuffling operation works only in-place, so you have to write it out the long way. It's also less efficient for large arrays, since you have to create the entire range up front, no matter how small k is.

    The other solution is to turn the second example into an array of list objects like the first one. I do not recommend this solution unless the only reason you are using numpy is for the choice function. In fact I wouldn't recommend it at all, since you can, and probably should, use pythons standard random module at this point. Disclaimers aside, you can coerce the datatype of the second array to be object. It will remove any benefits of using numpy, and can't be done directly. Simply setting dtype=object will still create a 2D array, but will store references to python int objects instead of primitives in it. You have to do something like this:

    Set = np.zeros(N, dtype=object)
    Set[:] = [[1, 2, 3], [2, 4]]
    

    You will now get an object essentially equivalent to the one in the first example, and can therefore apply np.random.choice directly.

    Note

    I show the legacy np.random methods here because of personal inertia if nothing else. The correct way, as suggested in the documentation I link to, is to use the new Generator API. This is especially true for the choice method, which is much more efficient in the new implementation. The usage is not any more difficult:

    Set[np.random.default_rng().choice(N, k, replace=False)]
    

    There are additional advantages, like the fact that you can now choose directly, even from a multidimensional array:

    np.random.default_rng().choice(Set2, k, replace=False)
    

    The same goes for shuffle, which, like choice, now allows you to select the axis you want to rearrange:

    np.random.default_rng().shuffle(Set)
    Set[:k]