Search code examples
pythontflearn

tflearn label encoding with large number of classes


I am trying to adapt the Convolutional Neural Net example of tflearn to do a classification with ~12000 distinct class labels and more than 1 million training examples. The number of labels is apparently a problem in terms of memory consumption when one-hot encoding them. I first map my string labels to continuous integers, I then pass these as a list to the to_categorical() function. The following code leads to a MemoryError:

trainY = to_categorical(trainY, nb_classes=n_classes)

Do I have to encode the labels like this or should I use a different loss function than cross-entropy? Can I train in batches with tflearn - can I pass a generator to the DNN.fit() function?

Thanks for any advice!


Solution

  • In the regression layer link, you can specify that the labels that are feed in should be one-hot encoded on the run

    tflearn.layers.regression(incoming_net,
                              loss = 'categorical_crossentropy',
                              batch_size = 64,
                              to_one_hot = True,
                              n_classes = 12000)
    

    In this way you should not have a memory error, because labels will be encoded in batches while training.