Search code examples
pythontensorflowtensorflow2.0tensorflow-datasets

How to merge two dict MapDataset to one MapDataset?


I'm having trouble concating several MapDatasets to one MapDataset. For example, one MapDataset is:

<MapDataset element_spec={'input_ids_task1': TensorSpec(), 'mask_task1': TensorSpec(), 'type_ids_task1': TensorSpec()}

Another is:

<MapDataset element_spec={'input_ids_task2': TensorSpec(), 'mask_task2': TensorSpec(), 'type_ids_task2': TensorSpec()}

I want to concat them to:

<MapDataset element_spec={'input_ids_task1': TensorSpec(), 'mask_task1': TensorSpec(), 'type_ids_task1': TensorSpec(), 'input_ids_task2': TensorSpec(), 'mask_task2': TensorSpec(), 'type_ids_task2': TensorSpec()}

I've seem some answers that could zip the two dataset by:

h = tf.data.Dataset.zip((a, b))

Then h would be a ZipDataset:

<ZipDataset element_spec=({'input_ids_task1': TensorSpec(), 'mask_task1': TensorSpec(), 'type_ids_task1': TensorSpec()}, {'input_ids_task2': TensorSpec(), 'mask_task2': TensorSpec(), 'type_ids_task2': TensorSpec()})

as the two datasets will be two dicts in a tuple.

I can retrieve the MapDataset by:

h.map(lambda x,y: x)

However, I'm not sure how could I merge them to one dict.

If that is not possible, could I change my input layers to a tuple containing several dicts to get the dataset input?


Solution

  • Not sure what exactly your data looks like, but you should be able to do something like this:

    import tensorflow as tf
    
    d1 = {
        'input_ids_task1': [[1, 2, 3], [1, 2, 2]],
        'mask_task1': [[1, 2, 3], [1, 2, 2]],
        'type_ids_task1': [[1, 2, 3], [1, 2, 2]] 
    }
    
    d2 = {
        'input_ids_task2': [[1, 2, 3], [1, 2, 2]],
        'mask_task2': [[1, 2, 3], [1, 2, 2]],
        'type_ids_task2': [[1, 2, 3], [1, 2, 2]] 
    }
    
    dataset1 = tf.data.Dataset.from_tensor_slices((d1))
    dataset2 = tf.data.Dataset.from_tensor_slices((d2))
    
    h = tf.data.Dataset.zip((dataset1, dataset2))
    h = h.map(lambda x, y: {**x, **y})
    print(h)
    
    for d in h:
      print(d)
    
    <MapDataset element_spec={'input_ids_task1': TensorSpec(shape=(3,), dtype=tf.int32, name=None), 'mask_task1': TensorSpec(shape=(3,), dtype=tf.int32, name=None), 'type_ids_task1': TensorSpec(shape=(3,), dtype=tf.int32, name=None), 'input_ids_task2': TensorSpec(shape=(3,), dtype=tf.int32, name=None), 'mask_task2': TensorSpec(shape=(3,), dtype=tf.int32, name=None), 'type_ids_task2': TensorSpec(shape=(3,), dtype=tf.int32, name=None)}>
    {'input_ids_task1': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>, 'mask_task1': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>, 'type_ids_task1': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>, 'input_ids_task2': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>, 'mask_task2': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>, 'type_ids_task2': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>}
    {'input_ids_task1': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 2], dtype=int32)>, 'mask_task1': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 2], dtype=int32)>, 'type_ids_task1': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 2], dtype=int32)>, 'input_ids_task2': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 2], dtype=int32)>, 'mask_task2': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 2], dtype=int32)>, 'type_ids_task2': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 2], dtype=int32)>}