Search code examples
pythontensorflownlptensorflow2.x

How to setup TF 2.4 Training data with generator or other means


I have a model setup with one input and two outputs. I am trying to use any of

  1. tf.data.Dataset.from_generator
  2. fit with regular python generator
  3. tf.data.TFRecordDataset

So far all my attempts have run into errors, which I can only assume is based on the shape/types involved in my output from the generators I've tried setting up. What format should the output of such a generator be? I am also super open to suggestions for doing this differently You can download my whole notebook here if you'd like to look through

The Input

The input to the model is of shape

(None,)

And is of type

tf.string

I am able to get model output with

model(tf.constant(['Hello TensorFlow!']))

The outputs

There are two output heads for the model, the first is of shape

(None, 128, 5)

The second is of shape

(None, 128, 3)

They both are of type

tf.float32

The loss for my model is sparse categorical crossentropy. (I want a softmax across 5 or 3 classes depending on the head, for each of the 128 outputs, with the None being there for the batch size). I believed for this the proper output format would be a tuple of batch_size instances of the following format

(input_string, (output_for_head1, output_for_head2))

where input_string is a string, output_for_head1 and output_for_head2 are both numpy arrays of shape (128) and type int.

Some random things I've tried for fitting on generator directly

Yield single item rather than whole batch (using batch size 10 for all testing)

Gets index out of bounds error- pretty sure this needs to be batched

Yield whole batch

Get error

    Data is expected to be in format `x`, `(x,)`, `(x, y)`, or `(x, y, sample_weight)`, found: ((<tf.Tensor: shape=(), dtype=string, numpy=b'Ya Yeet'>, (<tf.Tensor: shape=(128,), dtype=int64, numpy=... ( a very long set of (128,) tensors which is too large to post here)


     [[{{node PyFunc}}]]
     [[IteratorGetNext]] [Op:__inference_train_function_95064]

Function call stack:
train_function


Solution

  • I figured out the solution to this using generators. I was able to first create a generator yielding numpy arrays that the model could be trained on directly, and then create a tf.data dataset from a slightly modified version of that generator.

    The solution was to output just 3 numpy arrays per batch like input_arr, (output_arr1, output_arr2) the shape of each array was expanded to have the batch size on the left, rather than having a tuple of length batch_size.

    The final generators looked like this

    def text_data_generator(dataset_path, batch_size, input_text_col='text', output_classes_col='labels', classes=CLASSES, continuity_classes=CONTINUITY_CLASSES, pad_length=128, sep=' '):
        while True:
            for chunk in pd.read_csv(dataset_path, chunksize=batch_size):
                #TODO : Should probably shuffle the dataset somehow
                texts = chunk['text'].values
                c_classes = np.stack(chunk['classes'].apply(lambda x : pad([classes.index(item) for item in x.split(sep)])).values)
                c_continuity = np.stack(chunk['continuity'].apply(lambda x : pad([continuity_classes.index(item) for item in x.split(sep)])).values)
                texts = np.array(texts)
                c_classes = np.array(c_classes)
                c_continuity = np.array(c_continuity)
                yield texts, (c_classes, c_continuity)
    

    and

    def tf_text_data_generator(dataset_path, batch_size, input_text_col='text', output_classes_col='labels', classes=CLASSES, continuity_classes=CONTINUITY_CLASSES, pad_length=128, sep=' '):
        for chunk in pd.read_csv(dataset_path, chunksize=batch_size):
            texts = chunk['text'].values
            c_classes = np.stack(chunk['classes'].apply(lambda x : pad([classes.index(item) for item in x.split(sep)])).values)
            c_continuity = np.stack(chunk['continuity'].apply(lambda x : pad([continuity_classes.index(item) for item in x.split(sep)])).values)
            texts = np.array(texts)
            c_classes = np.array(c_classes)
            c_continuity = np.array(c_continuity)
            yield texts, (c_classes, c_continuity)
    

    The model could be trained directly on an instance of text_data_generator. To train on the other generator I created a tf.data.Dataset by

    def wrapped_gen():
        return tf_text_data_generator("test.csv", 10)
    dataset = tf.data.Dataset.from_generator(wrapped_gen, (tf.string, (tf.int64, tf.int64)))
    

    which then can be passed directly to model.train just as the instantiated generator could be.