Failed to allocate [xxx] bytes from HOST memory

After fixing Error loading ND4J Compressors (thank you Adam!), I get the following error: java.lang.RuntimeException: Failed to allocate 4735031021 bytes from HOST memory

(or java.lang.RuntimeException: cudaMalloc failed; Bytes: [4735031021]; Error code [2]; DEVICE [0])

17:31:16.143 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
17:32:10.593 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
17:32:10.625 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows Server 2019]
17:32:10.625 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [8]; Memory: [8,0GB];
17:32:10.625 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
17:32:10.657 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 11.6.55
17:32:10.657 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3090]; cc: [8.6]; Total memory: [25769279488]
17:32:10.657 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - Backend build information:
 MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN
17:44:35.415 [main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
17:44:39.735 [main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 0 is 7.222021991720728
Exception in thread "main" java.lang.RuntimeException: Failed to allocate 4735031021 bytes from HOST memory
        at org.nd4j.jita.memory.CudaMemoryManager.allocate(CudaMemoryManager.java:70)
        at org.nd4j.jita.workspace.CudaWorkspace.init(CudaWorkspace.java:88)
        at org.nd4j.linalg.api.memory.abstracts.Nd4jWorkspace.initializeWorkspace(Nd4jWorkspace.java:508)
        at org.nd4j.linalg.api.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:658)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:2040)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2813)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2756)
        at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:174)
        at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:61)
        at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:2357)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2315)
        at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2378)
        at FAClassifierLearning.main(FAClassifierLearning.java:120)

Looks like error came from model.fit(allTrainingData) after first iteration.

Error appears only when using GPU, everything works correctly on the CPU.

When run, trying pass parameters -Xmx28g -Dorg.bytedeco.javacpp.maxbytes=30G, but no succes...

My code

//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File("vector.txt")));

//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = Integer.parseInt(5422);
int numClasses = Integer.parseInt(1170);
int batchSize = 4000;

DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build();

List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();

while (iterator.hasNext()) {
    DataSet allData = iterator.next();
    allData.shuffle();
    SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.9);  // Use 90% of data for training
    trainingData.add(testAndTrain.getTrain());
    testData.add(testAndTrain.getTest());
}

DataSet allTrainingData = DataSet.merge(trainingData);
DataSet allTestData = DataSet.merge(testData);

//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):       
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(allTrainingData);           // Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(allTrainingData);     // Apply normalization to the training data
normalizer.transform(allTestData);         // Apply normalization to the test data. This is using statistics calculated from the *training* set

long seed = 6;
int firstHiddenLayerSize = labelIndex/6;
int secondHiddenLayerSize = firstHiddenLayerSize/4;

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
        .seed(seed)
        .activation(Activation.TANH)
        .weightInit(WeightInit.XAVIER)
        //.updater(new Sgd(0.1))
        .updater(Adam.builder().build())
        .l2(1e-4)
        .list()
        .layer(new DenseLayer.Builder().nIn(labelIndex).nOut(firstHiddenLayerSize)
                .build())
        .layer(new DenseLayer.Builder().nIn(firstHiddenLayerSize).nOut(secondHiddenLayerSize)
                .build())
        .layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                .activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
                .nIn(secondHiddenLayerSize).nOut(numClasses).build())
        .build();

//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();

//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));

for(int i=0; i<5000; i++) {
    model.fit(allTrainingData);
}

//evaluate the model on the test set
Evaluation eval = new Evaluation(numClasses);

INDArray output = model.output(allTestData.getFeatures());

eval.eval(allTestData.getLabels(), output);
log.info(eval.stats());

// Save the Model
File locationToSave = new File(trained-model.zip);
model.save(locationToSave, true);

// Save DataNormalization
NormalizerSerializer ns = NormalizerSerializer.getDefault();
ns.write(normalizer, new File(trained-normalizer.bin));

Updated code (fixing error, only what changed)

...
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build();

List<DataSet> trainingData = new ArrayList<>();

while (iterator.hasNext()) {
    trainingData.add(iterator.next());
}

DataSet allTrainingData = DataSet.merge(trainingData);

// We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):       
// The same in code above

// MultiLayerConfiguration conf... 
// The same in code above
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();

List<DataSet> allTrainingDataBatched = allTrainingData.batchBy(10000);
for (int i=0; i<5000; i++) {
    for (DataSet dataSet: allTrainingDataBatched) {
        model.fit(dataSet);
    }
}
...

Solution

Your GPU is not able to keep up with whatever you have locally.

HOST memory is your normal cpu ram. GPU ram is what's called device memory. Those are separate address spaces with their own limitations.

If you are running on a smaller GPU there might not be much you can do.

A few considerations: Consider shrinking your batch size Minimize allocations on the GPU only create your datasets after you are ready.

Monitor your GPU RAM using whatever tools you have available on y our platform of choice such as the windows process explorer or nvidia-smi.

Feel free to post below and I can try to offer more specific advice on your particular GPU.