Search code examples
pythontensorflowconceptual

(Conceptual question) Tensorflow dataset... why use it?


I'm taking a MOOC on Tensorflow 2 and in the class the assignments insist that we need to use tf datasets; however, it seems like all the hoops you have to jump through to do anything with datasets means that everything is way more difficult than using a Pandas dataframe, or a NumPy array... So why use it?


Solution

  • The things you mention are generally meant for small data, when a dataset can all fit in RAM (which is usually a few GB).

    In practice, datasets are often much bigger than that. One way of dealing with this is to store the dataset on a hard drive. There are other more complicated storage solutions. TF DataSets allow you to interface with these storage solutions easily. You create the DataSet object, which represents the storage, in your script, and then as far as you're concerned you can train a model on it as usual. But behind the scenes, TF is repeatedly reading the data into RAM, using it, and then discarding it.

    TF Datasets provide many helpful methods for handling big data, such as prefetching (doing the storage reading and data preprocessing at the same time as other stuff), multithreading (doing calculations like preprocessing on several examples at the same time), shuffling (which is harder to do when you can't just reorder a dataset each time in RAM), and batching (preparing sets of several examples for feeding to a model as a batch). All of this stuff would be a pain to do yourself in an optimised way with Pandas or NumPy.