Search code examples
pythonnumpymachine-learningscikit-learnone-hot-encoding

One-Hot Encoding without for-loop from vector of positions in Python with NumPy?


I have some data that I want to "one-hot encode" and it is represented as a 1-dimensional vector of positions.

Is there any function in NumPy that can expand my x into my x_ohe?

I'm trying to avoid using for-loops in Python at all costs for operations like this after watching Jake Vanderplas's talk

x = np.asarray([0,0,1,0,2])
x_ohe = np.zeros((len(x), 3), dtype=int)
for i, pos in enumerate(x):
    x_ohe[i,pos] = 1
x_ohe
# array([[1, 0, 0],
#        [1, 0, 0],
#        [0, 1, 0],
#        [1, 0, 0],
#        [0, 0, 1]])

Solution

  • If x only contains non negative integers, you can compare x with a sequence use numpy broadcasting and convert the result to ints:

    (x[:,None] == np.arange(x.max()+1)).astype(int)
    
    #array([[1, 0, 0],
    #       [1, 0, 0],
    #       [0, 1, 0],
    #       [1, 0, 0],
    #       [0, 0, 1]])
    

    Or initialize first, then assign ones use advanced indexing:

    x_ohe = np.zeros((len(x), 3), dtype=int)
    x_ohe[np.arange(len(x)), x] = 1
    x_ohe
    
    #array([[1, 0, 0],
    #       [1, 0, 0],
    #       [0, 1, 0],
    #       [1, 0, 0],
    #       [0, 0, 1]])