Search code examples
pythontensorflowmachine-learning

How to save a Dataset in multiple shards using `tf.data.Dataset.save`


How can I save a tf.data.Dataset in multiple shards using tf.data.Dataset.save()? I am reading in my dataset from CSV using tf.data.experimental.make_csv_dataset.

The TF docs here are not very helpful. There is a shard_func argument, but the examples given aren't helpfull and its not clear how to map to an int in a deterministic way. Using random ints doesn't seem to work either.

The solution in a similar question here generates an error for me TypeError: unsupported operand type(s) for %: 'collections.OrderedDict' and 'int'

Single Shard (works)

The below successfully saves to a single shard.

import pandas as pd
import numpy as np
import tensorflow as tf

# gen data
n=10000
pd.DataFrame(
    {'label': np.random.randint(low=0, high=2, size=n),
     'f1': np.random.random(n),
     'f2': np.random.random(n),
     'f3': np.random.random(n),
     'c1': np.random.randint(n),
     'c2': np.random.randint(n)}
).to_csv('tmp.csv')
# load data into a tf.data.Dataset
data_ts = tf.data.experimental.make_csv_dataset(
        'tmp.csv', 1, label_name='label', num_epochs=1)
data_ts.save('tmp.data')  # single shard, works!

Multiple shards usind randint (saves single shard)

Trying to save to multiple shard using a random number, still only saves to a single shard, albeit with a random int in the file name.

# Try sharding, using random numbers.
def random_shard_function(features, label):
    return np.int64(np.random.randint(10))
data_ts.save('tmp2.data', shard_func=random_shard_function)

image of filesystem

Modulo shard (error)

Trying the sollution from this question.

def modulo_shard_function(features, label):
    return x & 10
data_ts.save('tmp2.data', shard_func=modulo_shard_function)

TypeError: unsupported operand type(s) for &: 'collections.OrderedDict' and 'int'

Debugging - no idea how shard_fun works.

If I print out the inputs, it seems that the shard func is only run once, and the tensors are SymbolicTensors

def debug_shard_function(features, label):
    for val in features.items():
        print(f'{val=}')
    print(f'{label=}')
    print(f'{type(val[1])}')
    return np.int64(10)
data_ts.save('tmp2.data', shard_func=debug_shard_function)

Output:
Still saves to a single shard

val=('', <tf.Tensor 'args_0:0' shape=(None,) dtype=int32>)
val=('f1', <tf.Tensor 'args_3:0' shape=(None,) dtype=float32>)
val=('f2', <tf.Tensor 'args_4:0' shape=(None,) dtype=float32>)
val=('f3', <tf.Tensor 'args_5:0' shape=(None,) dtype=float32>)
val=('c1', <tf.Tensor 'args_1:0' shape=(None,) dtype=int32>)
val=('c2', <tf.Tensor 'args_2:0' shape=(None,) dtype=int32>)
label=<tf.Tensor 'args_6:0' shape=(None,) dtype=int32>
<class 'tensorflow.python.framework.ops.SymbolicTensor'>

Solution

  • shard_func must return a scalar Tensor of type tf.int64 (not Python or NumPy integer). So you cannot just return np.int64(...) or do a Python‐level % on dictionary.You need to pick (or compute) tensor inside dataset element and return tf.cast(..., tf.int64). For example if your CSV has a column "c1" you could do:

    def shard_func(features, label):
        return tf.cast(features['c1'][0] % 10, tf.int64)
    
    data_ts.save("my_data",shard_func=shard_func)
    

    This will produce up to 10 different shards (files) named my_data_0, my_data_1, etc