How can I save a tf.data.Dataset
in multiple shards using tf.data.Dataset.save()
? I am reading in my dataset from CSV using tf.data.experimental.make_csv_dataset
.
The TF docs here are not very helpful. There is a shard_func
argument, but the examples given aren't helpfull and its not clear how to map to an int
in a deterministic way. Using random int
s doesn't seem to work either.
The solution in a similar question here generates an error for me
TypeError: unsupported operand type(s) for %: 'collections.OrderedDict' and 'int'
The below successfully saves to a single shard.
import pandas as pd
import numpy as np
import tensorflow as tf
# gen data
n=10000
pd.DataFrame(
{'label': np.random.randint(low=0, high=2, size=n),
'f1': np.random.random(n),
'f2': np.random.random(n),
'f3': np.random.random(n),
'c1': np.random.randint(n),
'c2': np.random.randint(n)}
).to_csv('tmp.csv')
# load data into a tf.data.Dataset
data_ts = tf.data.experimental.make_csv_dataset(
'tmp.csv', 1, label_name='label', num_epochs=1)
data_ts.save('tmp.data') # single shard, works!
randint
(saves single shard)Trying to save to multiple shard using a random number, still only saves to a single shard, albeit with a random int in the file name.
# Try sharding, using random numbers.
def random_shard_function(features, label):
return np.int64(np.random.randint(10))
data_ts.save('tmp2.data', shard_func=random_shard_function)
Trying the sollution from this question.
def modulo_shard_function(features, label):
return x & 10
data_ts.save('tmp2.data', shard_func=modulo_shard_function)
TypeError: unsupported operand type(s) for &: 'collections.OrderedDict' and 'int'
If I print out the inputs, it seems that the shard func is only run once, and the tensors are SymbolicTensors
def debug_shard_function(features, label):
for val in features.items():
print(f'{val=}')
print(f'{label=}')
print(f'{type(val[1])}')
return np.int64(10)
data_ts.save('tmp2.data', shard_func=debug_shard_function)
Output:
Still saves to a single shard
val=('', <tf.Tensor 'args_0:0' shape=(None,) dtype=int32>)
val=('f1', <tf.Tensor 'args_3:0' shape=(None,) dtype=float32>)
val=('f2', <tf.Tensor 'args_4:0' shape=(None,) dtype=float32>)
val=('f3', <tf.Tensor 'args_5:0' shape=(None,) dtype=float32>)
val=('c1', <tf.Tensor 'args_1:0' shape=(None,) dtype=int32>)
val=('c2', <tf.Tensor 'args_2:0' shape=(None,) dtype=int32>)
label=<tf.Tensor 'args_6:0' shape=(None,) dtype=int32>
<class 'tensorflow.python.framework.ops.SymbolicTensor'>
shard_func
must return a scalar Tensor
of type tf.int64
(not Python or NumPy integer). So you cannot just return np.int64(...)
or do a Python‐level % on dictionary.You need to pick (or compute) tensor inside dataset element and return tf.cast(..., tf.int64)
. For example if your CSV has a column "c1" you could do:
def shard_func(features, label):
return tf.cast(features['c1'][0] % 10, tf.int64)
data_ts.save("my_data",shard_func=shard_func)
This will produce up to 10 different shards (files) named my_data_0
, my_data_1
, etc