Search code examples
pythonnumpytflearnone-hot-encoding

tflearn to_categorical: Processing data from pandas.df.values: array of arrays


labels = np.array([['positive'],['negative'],['negative'],['positive']])
# output from pandas is similar to the above
values = (labels=='positive').astype(np.int_)
to_categorical(values,2)

Output:

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

If I remove the inner list enclosing for each element, it seems to work just fine

labels = np.array([['positive'],['negative'],['negative'],['positive']])
values = (labels=='positive').astype(np.int_)
to_categorical(values.T[0],2)

Output:

array([[ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.]])

Why is it behaving this way? I'm following some tutorials, but they seem to have gotten the right output even for array of arrays. Is that recently upgraded to behave this way?

I'm using tflearn (0.3.2) on py362


Solution

  • Take a look at the source code for the to_categorical:

    def to_categorical(y, nb_classes):
        """ to_categorical.
    
        Convert class vector (integers from 0 to nb_classes)
        to binary class matrix, for use with categorical_crossentropy.
    
        Arguments:
            y: `array`. Class vector to convert.
            nb_classes: `int`. Total number of classes.
    
        """
        y = np.asarray(y, dtype='int32')
        if not nb_classes:
            nb_classes = np.max(y)+1
        Y = np.zeros((len(y), nb_classes))
        Y[np.arange(len(y)),y] = 1.
        return Y
    

    The core part is the advanced indexing Y[np.arange(len(y)),y] = 1, which treats the input vector y as column index in the result array; So y needs to be a 1d array to work properly, you will generally get a broadcasting error for an arbitrary 2d array:

    For instance:

    to_categorical([[1,2,3],[2,3,4]], 2)
    

    --------------------------------------------------------------------------- IndexError Traceback (most recent call last) in () ----> 1 to_categorical([[1,2,3],[2,3,4]], 2)

    c:\anaconda3\envs\tensorflow\lib\site-packages\tflearn\data_utils.py in to_categorical(y, nb_classes) 40 nb_classes = np.max(y)+1 41 Y = np.zeros((len(y), nb_classes)) ---> 42 Y[np.arange(len(y)),y] = 1. 43 return Y 44

    IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (2,3)

    Either of these methods works fine:

    to_categorical(values.ravel(), 2)
    array([[ 0.,  1.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 0.,  1.]])
    
    to_categorical(values.squeeze(), 2)
    array([[ 0.,  1.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 0.,  1.]])
    
    to_categorical(values[:,0], 2)
    array([[ 0.,  1.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 0.,  1.]])