No cache file written in TensorFlow dataset

I am trying to manage a large image dataset, that does not fit in the memory, while requiring some specific calculation. Currently, my code looks like this:

  files  = [str(f) for f in self.files]
  labels = self.categories
  batch_size= 32


  dataset = tf.data.Dataset.from_generator(
        lambda: zip(files, labels),
        output_types=(tf.string, tf.uint8),
        output_shapes=(tf.TensorShape([]), tf.TensorShape([]))
  )

  dataset = dataset.map(
        lambda x, y: tf.py_function(_parser, [x, y, category_count], [tf.float32, tf.uint8]),
        num_parallel_calls=tf.data.experimental.AUTOTUNE,
        deterministic=False)

  dataset.cache(filename='/tmp/dataset.tmp')

  if mode == tf.estimator.ModeKeys.TRAIN:
        dataset = dataset.shuffle(buffer_size=10*batch_size, reshuffle_each_iteration=True)

  dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)

  if mode == tf.estimator.ModeKeys.TRAIN:
        dataset.repeat(None)
  else:
        dataset.repeat(1)

  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

The _parser() function opens a image file, does a bunch of transformations, and returns a tensor and a 1-hot encoded vector. The caching step does not seem to work properly, however:

There is not significant improvement of the computation time between the 1st epoch and the following ones
No cache file is created during the process, although the swap partition is almost full (~90%)

Does the cache() function creates a file only when both the memory and the swap partition is full? Furthermore, I expect to read only batch_size files at a time. However, it seems that all files are read at once during the mapping step. Should I consider using interleave() combined with from_generator() instead? Or maybe should I batched the files first, then map them?

Solution

In general, it is not true that lots of RAM is needed to cache.

As opposed to other libaries (like gensim or hugging-face), Tensorflow's basic caching is simplistic, by the design choice. As of now (checked under version 2.12) it uses RAM excessively and doesn't handle garbage collecting well. This snippet demonstrates that caching consumes RAM linearly with the data size and doesn't freeze resources in the second epoch:

import numpy as np
import tensorflow as tf
import psutil

IMG_SHAPE = (224,224,3)

def gen_img(shape=IMG_SHAPE):
  while True:
    img = np.random.randint(0,256,size=IMG_SHAPE)
    lab = np.random.randint(0,10)
    yield (img,lab)

ds = tf.data.Dataset.from_generator(
      gen_img,
      output_signature=(
        tf.TensorSpec(shape=IMG_SHAPE, dtype=tf.int32),
        tf.TensorSpec(shape=(), dtype=tf.int32)
      )
)
# !rm ./my_cached_dataset*
ds = ds.take(int(1e4)).cache('./my_cached_dataset').repeat(2)

for i,(img,lab) in enumerate(ds):
  if i%1000==0:
    print(psutil.virtual_memory())

This gives the following results on Google Colab

svmem(total=13613314048, available=11903979520, percent=12.6, used=1375707136, free=9484337152, active=541478912, inactive=3288244224, buffers=42360832, cached=2710908928, shared=2764800, slab=214601728)
svmem(total=13613314048, available=11349929984, percent=16.6, used=1927380992, free=9098743808, active=538705920, inactive=3477790720, buffers=42557440, cached=2544631808, shared=2772992, slab=236298240)
svmem(total=13613314048, available=11246673920, percent=17.4, used=2030444544, free=8296189952, active=539701248, inactive=4435984384, buffers=43372544, cached=3243307008, shared=2772992, slab=266022912)
svmem(total=13613314048, available=10702491648, percent=21.4, used=2574770176, free=6455230464, active=543043584, inactive=6231724032, buffers=43421696, cached=4539891712, shared=2772992, slab=300105728)
svmem(total=13613314048, available=10379468800, percent=23.8, used=2897776640, free=4922003456, active=543133696, inactive=7728226304, buffers=43446272, cached=5750087680, shared=2772992, slab=334139392)
svmem(total=13613314048, available=10069651456, percent=26.0, used=3207753728, free=3356516352, active=543207424, inactive=9257857024, buffers=43511808, cached=7005532160, shared=2772992, slab=369360896)
svmem(total=13613314048, available=9731747840, percent=28.5, used=3545501696, free=1802670080, active=543256576, inactive=10778898432, buffers=43560960, cached=8221581312, shared=2772992, slab=403521536)
svmem(total=13613314048, available=9435697152, percent=30.7, used=3841613824, free=266637312, active=543305728, inactive=12278542336, buffers=43610112, cached=9461452800, shared=2772992, slab=438865920)
svmem(total=13613314048, available=9271164928, percent=31.9, used=4006137856, free=193994752, active=543870976, inactive=12340707328, buffers=43122688, cached=9370058752, shared=2772992, slab=440442880)
svmem(total=13613314048, available=8968581120, percent=34.1, used=4308578304, free=169992192, active=543911936, inactive=12344811520, buffers=42754048, cached=9091989504, shared=2772992, slab=435945472)
svmem(total=13613314048, available=8662331392, percent=36.4, used=4615012352, free=169848832, active=543952896, inactive=12350521344, buffers=42803200, cached=8785649664, shared=2772992, slab=428064768)
svmem(total=13613314048, available=9466744832, percent=30.5, used=3810525184, free=163422208, active=543965184, inactive=12362956800, buffers=42827776, cached=9596538880, shared=2772992, slab=416862208)
svmem(total=13613314048, available=9460772864, percent=30.5, used=3816542208, free=155451392, active=543985664, inactive=12395225088, buffers=42835968, cached=9598484480, shared=2772992, slab=382918656)
svmem(total=13613314048, available=9467899904, percent=30.5, used=3809370112, free=160645120, active=543797248, inactive=12427423744, buffers=42835968, cached=9600462848, shared=2772992, slab=349220864)
svmem(total=13613314048, available=9470406656, percent=30.4, used=3806834688, free=161198080, active=543805440, inactive=12460277760, buffers=42835968, cached=9602445312, shared=2772992, slab=315473920)
svmem(total=13613314048, available=9479843840, percent=30.4, used=3797512192, free=161202176, active=543797248, inactive=12491632640, buffers=42835968, cached=9611763712, shared=2772992, slab=291315712)
svmem(total=13613314048, available=9487978496, percent=30.3, used=3789242368, free=166912000, active=543797248, inactive=12523065344, buffers=42835968, cached=9614323712, shared=2772992, slab=262230016)
svmem(total=13613314048, available=9505796096, percent=30.2, used=3771478016, free=183492608, active=543797248, inactive=12555304960, buffers=42835968, cached=9615507456, shared=2772992, slab=229867520)
svmem(total=13613314048, available=9550958592, percent=29.8, used=3728154624, free=183087104, active=543797248, inactive=12567662592, buffers=42835968, cached=9659236352, shared=2772992, slab=215998464)
svmem(total=13613314048, available=9558626304, percent=29.8, used=3720568832, free=190627840, active=543797248, inactive=12567326720, buffers=42835968, cached=9659281408, shared=2772992, slab=215982080)

In an attempt to make up for these shortcomings, Tensorflow has released a more advanced cache-like operation called snapshort. But it is, of now (July'23), experimental and poorly documented.