Search code examples
pythontensorflowtensorflow-datasets

Tensorflow map function split the dataset structure


I am having a trouble with the structure of the dataset in the tensorflow map function. This is how my data looks like :

simple

`train_examples = tf.data.Dataset.from_tensor_slices(train_data)
[[0,1,2,3,4,5,...],
 [32,33,34,35,36,...]],

real

print(train_data[0])
[[array([2,539, 400, 513, 398, 523, 485, 533, 568, 566, 402, 565, 491,
   570, 576, 539, 351, 538, 297, 539, 262, 564, 313, 581, 370, 589,
   421, 514, 314, 501, 370, 489, 420,3]), array([2, 534, 403, 507, 401, 519, 487, 531, 567, 562, 405, 544, 495,
   537, 588, 528, 354, 526, 300, 534, 259, 555, 315, 575, 370, 589,
   421, 499, 315, 489, 372, 483, 423,3])]]

I convert to tensor for the pipeline <TensorSliceDataset shapes: (2, 34), types: tf.int64>

The train_examples contains 2D tensors, [[source],[target]] with 17k rows.

def make_batches(ds):
    return (
        ds
        .cache()
        .shuffle(BUFFER_SIZE)
        .batch(BATCH_SIZE)
        .map(lambda x_int,y_int: x_int,y_int, num_parallel_calls=tf.data.experimental.AUTOTUNE)
        .prefetch(tf.data.experimental.AUTOTUNE))

train_batches = make_batches(train_examples)

for the map, I want the data structure output with source and target separately. I tried with the function map(prepare, num_parallel_calls=tf.data.experimental.AUTOTUNE)

def prepare(ds):
  srcs = tf.ragged.constant(ds.numpy().[0],tf.int64)
  trgs = tf.ragged.constant(ds.numpy().[1],tf.int64)

  srcs = srcs.to_tensor()
  trgs = trgs.to_tensor()
  return srcs,trgs

But the tensorflow doesn't allow the eager execution in the map function. If there is other thing I missed about the usage of map function in Tensorflow, please let me know. Thank you.

Tensorflow version = 2.7


Solution

  • You could try splitting your samples like this:

    import tensorflow as tf
    import numpy as np
    
    
    data = [[np.array([2,539, 400, 513, 398, 523, 485, 533, 568, 566, 402, 565, 491,
       570, 576, 539, 351, 538, 297, 539, 262, 564, 313, 581, 370, 589,
       421, 514, 314, 501, 370, 489, 420,3]), np.array([2, 534, 403, 507, 401, 519, 487, 531, 567, 562, 405, 544, 495,
       537, 588, 528, 354, 526, 300, 534, 259, 555, 315, 575, 370, 589,
       421, 499, 315, 489, 372, 483, 423,3])]]
    
    samples = 50
    data = data * samples
    ds = tf.data.Dataset.from_tensor_slices(data)
    
    def prepare(x):
      srcs, trgs = tf.split(x, num_or_size_splits = 2, axis=1)
      return srcs,trgs
    
    def make_batches(ds):
        return (
            ds
            .cache()
            .shuffle(50)
            .batch(10)
            .map(prepare, num_parallel_calls=tf.data.experimental.AUTOTUNE)
            .prefetch(tf.data.experimental.AUTOTUNE))
    
    train_batches = make_batches(ds)
    for x, y in train_batches.take(1):
      print(x.shape, y.shape)
    
    (10, 1, 34) (10, 1, 34)