Search code examples
pythonnumpyboolean-indexing

Explanation of boolean indexing behaviors


For the 2D array y:

y = np.arange(20).reshape(5,4)
---
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]

All indexing select 1st, 3rd, and 5th rows. This is clear.

print(y[
    [0, 2, 4],
    ::
])
print(y[
    [0, 2, 4],
    ::
])
print(y[
    [True, False, True, False, True],
    ::
])
---
[[ 0  1  2  3]
 [ 8  9 10 11]
 [16 17 18 19]]

Questions

Please help understand what rules or mechanism are working to produce the results.

Replacing [] with tuple produces an empty array with shape (0, 5, 4).

y[
    (True, False, True, False, True)
]
---
array([], shape=(0, 5, 4), dtype=int64)

Use single True adds a new axis.

y[True]
---
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]]])


y[True].shape
---
(1, 5, 4)

Adding additional boolean True produces the same.

y[True, True]
---
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]]])

y[True, True].shape
---
(1, 5, 4)

However, adding False boolean causes the empty array again.

y[True, False]
---
array([], shape=(0, 5, 4), dtype=int64)

Not sure the documentation explains this behavior.

In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].

If there is only one Boolean array and no integer indexing array present, this is straight forward. Care must only be taken to make sure that the boolean index has exactly as many dimensions as it is supposed to work with.


Solution

  • Boolean scalar indexing is not well-documented, but you can trace how it is handled in the source code. See for example this comment and associated code in the numpy source:

    /*
    * This can actually be well defined. A new axis is added,
    * but at the same time no axis is "used". So if we have True,
    * we add a new axis (a bit like with np.newaxis). If it is
    * False, we add a new axis, but this axis has 0 entries.
    */
    

    So if an index is a scalar boolean, a new axis is added. If the value is True the size of the axis is 1, and if the value is False, the size of the axis is zero.

    This behavior was introduced in numpy#3798, and the author outlines the motivation in this comment; roughly, the aim was to provide consistency in the output of filtering operations. For example:

    x = np.ones((2, 2))
    assert x[x > 0].ndim == 1
    
    x = np.ones(2)
    assert x[x > 0].ndim == 1
    
    x = np.ones(())
    assert x[x > 0].ndim == 1  # scalar boolean here!
    

    The interesting thing is that any subsequent scalar booleans after the first do not add additional dimensions! From an implementation standpoint, this seems to be due to consecutive 0D boolean indices being treated as equivalent to consecutive fancy indices (i.e. HAS_0D_BOOL is treated as HAS_FANCY in some cases) and thus are combined in the same way as fancy indices. From a logical standpoint, this corner-case behavior does not appear to be intentional: for example, I can't find any discussion of it in numpy#3798.

    Given that, I would recommend considering this behavior poorly-defined, and avoid it in favor of well-documented indexing approaches.