I am trying to manage a large image dataset, that does not fit in the memory, while requiring some specific calculation. Currently, my code looks like this:
files = [str(f) for f in self.files]
labels = self.categories
batch_size= 32
dataset = tf.data.Dataset.from_generator(
lambda: zip(files, labels),
output_types=(tf.string, tf.uint8),
output_shapes=(tf.TensorShape([]), tf.TensorShape([]))
)
dataset = dataset.map(
lambda x, y: tf.py_function(_parser, [x, y, category_count], [tf.float32, tf.uint8]),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
deterministic=False)
dataset.cache(filename='/tmp/dataset.tmp')
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.shuffle(buffer_size=10*batch_size, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
if mode == tf.estimator.ModeKeys.TRAIN:
dataset.repeat(None)
else:
dataset.repeat(1)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
The _parser()
function opens a image file, does a bunch of transformations, and returns a tensor and a 1-hot encoded vector. The caching step does not seem to work properly, however:
Does the cache()
function creates a file only when both the memory and the swap partition is full? Furthermore, I expect to read only batch_size
files at a time. However, it seems that all files are read at once during the mapping step. Should I consider using interleave()
combined with from_generator()
instead? Or maybe should I batched the files first, then map them?
In general, it is not true that lots of RAM is needed to cache.
As opposed to other libaries (like gensim
or hugging-face
), Tensorflow's basic caching is simplistic, by the design choice. As of now (checked under version 2.12) it uses RAM excessively and doesn't handle garbage collecting well. This snippet demonstrates that caching consumes RAM linearly with the data size and doesn't freeze resources in the second epoch:
import numpy as np
import tensorflow as tf
import psutil
IMG_SHAPE = (224,224,3)
def gen_img(shape=IMG_SHAPE):
while True:
img = np.random.randint(0,256,size=IMG_SHAPE)
lab = np.random.randint(0,10)
yield (img,lab)
ds = tf.data.Dataset.from_generator(
gen_img,
output_signature=(
tf.TensorSpec(shape=IMG_SHAPE, dtype=tf.int32),
tf.TensorSpec(shape=(), dtype=tf.int32)
)
)
# !rm ./my_cached_dataset*
ds = ds.take(int(1e4)).cache('./my_cached_dataset').repeat(2)
for i,(img,lab) in enumerate(ds):
if i%1000==0:
print(psutil.virtual_memory())
This gives the following results on Google Colab
svmem(total=13613314048, available=11903979520, percent=12.6, used=1375707136, free=9484337152, active=541478912, inactive=3288244224, buffers=42360832, cached=2710908928, shared=2764800, slab=214601728)
svmem(total=13613314048, available=11349929984, percent=16.6, used=1927380992, free=9098743808, active=538705920, inactive=3477790720, buffers=42557440, cached=2544631808, shared=2772992, slab=236298240)
svmem(total=13613314048, available=11246673920, percent=17.4, used=2030444544, free=8296189952, active=539701248, inactive=4435984384, buffers=43372544, cached=3243307008, shared=2772992, slab=266022912)
svmem(total=13613314048, available=10702491648, percent=21.4, used=2574770176, free=6455230464, active=543043584, inactive=6231724032, buffers=43421696, cached=4539891712, shared=2772992, slab=300105728)
svmem(total=13613314048, available=10379468800, percent=23.8, used=2897776640, free=4922003456, active=543133696, inactive=7728226304, buffers=43446272, cached=5750087680, shared=2772992, slab=334139392)
svmem(total=13613314048, available=10069651456, percent=26.0, used=3207753728, free=3356516352, active=543207424, inactive=9257857024, buffers=43511808, cached=7005532160, shared=2772992, slab=369360896)
svmem(total=13613314048, available=9731747840, percent=28.5, used=3545501696, free=1802670080, active=543256576, inactive=10778898432, buffers=43560960, cached=8221581312, shared=2772992, slab=403521536)
svmem(total=13613314048, available=9435697152, percent=30.7, used=3841613824, free=266637312, active=543305728, inactive=12278542336, buffers=43610112, cached=9461452800, shared=2772992, slab=438865920)
svmem(total=13613314048, available=9271164928, percent=31.9, used=4006137856, free=193994752, active=543870976, inactive=12340707328, buffers=43122688, cached=9370058752, shared=2772992, slab=440442880)
svmem(total=13613314048, available=8968581120, percent=34.1, used=4308578304, free=169992192, active=543911936, inactive=12344811520, buffers=42754048, cached=9091989504, shared=2772992, slab=435945472)
svmem(total=13613314048, available=8662331392, percent=36.4, used=4615012352, free=169848832, active=543952896, inactive=12350521344, buffers=42803200, cached=8785649664, shared=2772992, slab=428064768)
svmem(total=13613314048, available=9466744832, percent=30.5, used=3810525184, free=163422208, active=543965184, inactive=12362956800, buffers=42827776, cached=9596538880, shared=2772992, slab=416862208)
svmem(total=13613314048, available=9460772864, percent=30.5, used=3816542208, free=155451392, active=543985664, inactive=12395225088, buffers=42835968, cached=9598484480, shared=2772992, slab=382918656)
svmem(total=13613314048, available=9467899904, percent=30.5, used=3809370112, free=160645120, active=543797248, inactive=12427423744, buffers=42835968, cached=9600462848, shared=2772992, slab=349220864)
svmem(total=13613314048, available=9470406656, percent=30.4, used=3806834688, free=161198080, active=543805440, inactive=12460277760, buffers=42835968, cached=9602445312, shared=2772992, slab=315473920)
svmem(total=13613314048, available=9479843840, percent=30.4, used=3797512192, free=161202176, active=543797248, inactive=12491632640, buffers=42835968, cached=9611763712, shared=2772992, slab=291315712)
svmem(total=13613314048, available=9487978496, percent=30.3, used=3789242368, free=166912000, active=543797248, inactive=12523065344, buffers=42835968, cached=9614323712, shared=2772992, slab=262230016)
svmem(total=13613314048, available=9505796096, percent=30.2, used=3771478016, free=183492608, active=543797248, inactive=12555304960, buffers=42835968, cached=9615507456, shared=2772992, slab=229867520)
svmem(total=13613314048, available=9550958592, percent=29.8, used=3728154624, free=183087104, active=543797248, inactive=12567662592, buffers=42835968, cached=9659236352, shared=2772992, slab=215998464)
svmem(total=13613314048, available=9558626304, percent=29.8, used=3720568832, free=190627840, active=543797248, inactive=12567326720, buffers=42835968, cached=9659281408, shared=2772992, slab=215982080)
In an attempt to make up for these shortcomings, Tensorflow has released a more advanced cache-like operation called snapshort
. But it is, of now (July'23), experimental and poorly documented.