Search code examples
tensorflowmachine-learningkerasnlp

How to train a NLP text model where text files are stored in category named folders?


I've mainly worked on Image classification problems so far and the flow_from_directory of the ImageDataGenerator has always made it simple working with data stored in folder categories. I'm trying to train a model that uses both image and text but need to first figure out how to read then preprocess text data stored in the same way. I haven't found much answers on this after searching, any ideas?

I know it creating my own generator for this case could help but I couldn't make one that satisfies my needs.


Solution

  • You can use text_dataset_from_directory. It works similarly to image_dataset_from_directory.

    batch_size = 1024
    seed = 123
    train_ds = tf.keras.utils.text_dataset_from_directory(
        'aclImdb/train', batch_size=batch_size, validation_split=0.2,
        subset='training', seed=seed)
    val_ds = tf.keras.utils.text_dataset_from_directory(
        'aclImdb/train', batch_size=batch_size, validation_split=0.2,
        subset='validation', seed=seed)
    

    Reference: https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory Working Example: https://www.tensorflow.org/text/guide/word_embeddings