How to train a NLP text model where text files are stored in category named folders?

I've mainly worked on Image classification problems so far and the flow_from_directory of the ImageDataGenerator has always made it simple working with data stored in folder categories. I'm trying to train a model that uses both image and text but need to first figure out how to read then preprocess text data stored in the same way. I haven't found much answers on this after searching, any ideas?

I know it creating my own generator for this case could help but I couldn't make one that satisfies my needs.

Solution

You can use text_dataset_from_directory. It works similarly to image_dataset_from_directory.

batch_size = 1024
seed = 123
train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

Reference: https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory Working Example: https://www.tensorflow.org/text/guide/word_embeddings