python numpy machine-learning one-hot-encoding

One-Hot Encode numpy array with >2 dims

I have a numpy array of shape (192, 224, 192, 1). The last dimension is the integer class that I would like to one hot encode. For example, if I have 12 classes I would like the of the resulting array to be (192, 224, 192, 12), with the last dimension being all zeros but a 1 at the index corresponding to the original value.

I can do this is naively with many for loops, but would like to know if there is a better way to do this.

Solution

You can do this in a single indexing operation if you know the max. Given an array a and m = a.max() + 1:

out = np.zeros(a.shape[:-1] + (m,), dtype=bool)
out[(*np.indices(a.shape[:-1], sparse=True), a[..., 0])] = True

It's easier if you remove the unnecessary trailing dimension:

a = np.squeeze(a)
out = np.zeros(a.shape + (m,), bool)
out[(*np.indices(a.shape, sparse=True), a)] = True

The explicit tuple in the index is necessary to do star expansion.

If you want to extend this to an arbitrary dimension, you can do that too. The following will insert a new dimension into the squeezed array at axis. Here axis is the position in the final array of the new axis, which is consistent with say np.stack, but not consistent with list.insert:

def onehot(a, axis=-1, dtype=bool):
    pos = axis if axis >= 0 else a.ndim + axis + 1
    shape = list(a.shape)
    shape.insert(pos, a.max() + 1)
    out = np.zeros(shape, dtype)
    ind = list(np.indices(a.shape, sparse=True))
    ind.insert(pos, a)
    out[tuple(ind)] = True
    return out

If you have a singleton dimension to expand, the generalized solution can find the first available singleton dimension:

def onehot2(a, axis=None, dtype=bool):
    shape = np.array(a.shape)
    if axis is None:
        axis = (shape == 1).argmax()
    if shape[axis] != 1:
        raise ValueError(f'Dimension at {axis} is non-singleton')
    shape[axis] = a.max() + 1
    out = np.zeros(shape, dtype)
    ind = list(np.indices(a.shape, sparse=True))
    ind[axis] = a
    out[tuple(ind)] = True
    return out

To use the last available singleton, replace axis = (shape == 1).argmax() with

axis = a.ndim - 1 - (shape[::-1] == 1).argmax()

Here are some example usages:

>>> np.random.seed(0x111)
>>> x = np.random.randint(5, size=(3, 2))
>>> x
array([[2, 3],
       [3, 1],
       [4, 0]])

>>> a = onehot(x, axis=-1, dtype=int)
>>> a.shape
(3, 2, 5)
>>> a
array([[[0, 0, 1, 0, 0],    # 2
        [0, 0, 0, 1, 0]],   # 3

       [[0, 0, 0, 1, 0],    # 3
        [0, 1, 0, 0, 0]],   # 1

       [[0, 0, 0, 0, 1],    # 4
        [1, 0, 0, 0, 0]]]   # 0

>>> b = onehot(x, axis=-2, dtype=int)
>>> b.shape
(3, 5, 2)
>>> b
array([[[0, 0],
        [0, 0],
        [1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 1],
        [0, 0],
        [1, 0],
        [0, 0]],

       [[0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0]]])

onehot2 requires you to mark the dimension you want to add as a singleton:

>>> np.random.seed(0x111)
>>> y = np.random.randint(5, size=(3, 1, 2, 1))
>>> y
array([[[[2],
         [3]]],
       [[[3],
         [1]]],
       [[[4],
         [0]]]])

>>> c = onehot2(y, axis=-1, dtype=int)
>>> c.shape
(3, 1, 2, 5)
>>> c
array([[[[0, 0, 1, 0, 0],
         [0, 0, 0, 1, 0]]],

       [[[0, 0, 0, 1, 0],
         [0, 1, 0, 0, 0]]],

       [[[0, 0, 0, 0, 1],
         [1, 0, 0, 0, 0]]]])

>>> d = onehot2(y, axis=-2, dtype=int)
ValueError: Dimension at -2 is non-singleton

>>> e = onehot2(y, dtype=int)
>>> e.shape
(3, 5, 2, 1)
>>> e.squeeze()
array([[[0, 0],
        [0, 0],
        [1, 0],
        [0, 1],
        [0, 0]],

       [[0, 0],
        [0, 1],
        [0, 0],
        [1, 0],
        [0, 0]],

       [[0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0]]])