Search code examples
javascriptnode.jsartificial-intelligencetensorflow.js

How do I split a large Dataset in 2 for validation with Tensorflow.js?


I'm using Tensorflow.js in Node.js with an Nvidia CUDA-capable GPU (note that this is NOT Python), and I have implemented an AI model. I have a Dataset object that represents the input data I would like to train my model on.

However, I would like to do an 80% - 20% split on my data, with 80% used for training, and 20% used for validation.

In the .fitDataset() method, the validationData setting is present for specifying validation data.

Unfortunately though, I have just a single Dataset object that represents my entire dataset.

Additionally, my training data is both temporal and extremely large - and my Dataset object is backed by a Generator function. To this end, I'd like the last 20% of the the Dataset object to act as my validation data.

What's the most efficient way to split a single Dataset object in 2 without loading it all into memory such that I can use the last 20% of it as validation data?


Solution

  • The data is not all loaded in memory but iteratively. So a filtering can be applied on the loaded data to split into two datasets

    // first load the dataset
    const csvDataset = tf.data.csv(csvUrl);
    
    //split dataset
    let i = 0;
    trainDataset = csvDataset.filter(x => i++%5 !== 0)
    i = 0
    testDataset = csvDataset.filter(x => i++%5 === 0)