Search code examples
deeplearning4j

How to create training and test DataSetIterators in deeplearning4j?


I am building a recurrent neural network with deeplearning4j and I need to create the training and test data sets.

All the examples provided in the documentation and the example code, use a CSVSequenceRecordReader to read CSV files.

Then a DataSetIterator is created with the SequenceRecordReaderDataSetIterator constructor and fed into the MultiLayerNetwork.fit() or the MultiLayerNetwork.evaluate() method (depending if it's a training or test data set iterator).

However, in my case, the data set I have is not stored in a CSV file. I access it online through a third-party library, pre-process it to obtain a List<Data> and a List<Labels> objects.

How can I:

1) create the DataSetIterator from my two lists?

2) split the DataSetIterator in a training set and a test set?

Edit:

I think my question is too broad. Let me try to narrow it down.

I have started to read this article which uses a very simple approach to create a data set:

It creates two INDArrays and builds a DataSet from them using the DataSet(INDArray first, INDArray second) constructor.

Training the data works using network.fit(dataSet);, but I can't evaluate it while training, as the method evaluate requires an data set iterator, not a data set.

Moreover, from what I understand, using this approach also means that there is only one huge data set, no mini batches.

I also guess that I could create mini batches from this big data set by using the batchBy(int num) method. But this method returns a list of data sets, and not an data set iterator... iterateWithMiniBatches() does return a data set iterator but when I looked at the source file, it returns null and is deprecated. Then I tried to see if there is an implementation of the DataSetIterator I could use, but there are a lot of them. I tried the BaseDataSetIterator but it does not take a DataSet as constructor parameter but a DataSetFetcher... Yet another layer.

Is there somewhere an example that shows how to create a data set without using the default record readers? Or should I just create my how implementation of a record reader?


Solution

  • 1)

    MultiLayerNetwork.evaluate() accepts ListDataSetIterator as a parameter

    If you have a List<Data> object you can first map it into a double[] featureVector and a double[] labelVector and then create a ListDataSetIterator like this

        INDArray x = Nd4j.create(featureVector, new int[]{featureVector.length/numberOfFeatures, numberOfFeatures}, 'c');
        INDArray y = Nd4j.create(labelVector, new int[]{labelVector.length/numberOfLabels, numberOfLabels}, 'c');
    
        final DataSet allData = new DataSet(x,y);
    
        final List<DataSet> list = allData.asList();
    
        ListDataSetIterator iterator = new ListDataSetIterator(list);
    

    For 2) you should just create two seperate iterators, one for training, one for testing.

    You can then evaluate your net with net.evaluate(testIterator);