python tensorflow keras multidimensional-array deep-learning

Non-OK-status: GpuLaunchKernel Error when predicting, but training runs smoothly

I am training a 3D U-net to do multi-label (4 classes) semantic segmentation. Training with model.fit() runs just fine with no errors and I see that the model is learning. However, when I try to run model.predict() I get the following error:

85/85 - 56s
2022-12-22 18:26:24.265485: F tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc:165] Non-OK-status: GpuLaunchKernel( concat_variable_kernel<T, IntType, true>, config.block_count, config.thread_per_block, smem_usage, gpu_device.stream(), input_ptrs, output_scan, static_cast<IntType>(output->dimension(0)), static_cast<IntType>(output->dimension(1)), output->data()) status: Internal: invalid configuration argument
/cm/local/apps/slurm/var/spool/job5510720/slurm_script: line 14:  1945 Aborted

Here's a simplified and abbreviated version of my code:

import tensorflow as tf
from keras.models import Model
from keras.models import load_model
from tensorflow.keras.optimizers import Adam, SGD
from keras.layers import Conv3D, MaxPooling3D, Conv3DTranspose, UpSampling3D, Concatenate


def unet(input_shape,filters,kernel,model_name):

    strides_1 = (1,1,1)
    strides_2 = (2,2,2)
    ins = Input(shape=input_shape,name='input_1')

    encode1a = Conv3D(filters=filters, kernel_size=kernel, activation='relu', padding='same', name='encode1a', strides=strides_1)(x)
    encode1b = Conv3D(filters=filters, kernel_size=kernel, activation='relu', padding='same', name='encode1b', strides=strides_1)(encode1a)
    pool1 = MaxPooling3D(pool_size=(2, 2, 2), padding='same', name='pool1')(encode1b)

    encode2a = Conv3D(filters=2*filters, kernel_size=kernel, activation='relu', padding='same', name='encode2a', strides=strides_1)(pool1)
    encode2b = Conv3D(filters=2*filters, kernel_size=kernel, activation='relu', padding='same', name='encode2b', strides=strides_1)(encode2a)
    pool2 = MaxPooling3D(pool_size=(2, 2, 2), padding='same', name='pool2')(encode2b)

    encode3a = Conv3D(filters=4*filters, kernel_size=kernel, activation='relu', padding='same', name='encode3a', strides=strides_1)(pool2)
    encode3b = Conv3D(filters=4*filters, kernel_size=kernel, activation='relu', padding='same', name='encode3b', strides=strides_1)(encode3a)
    pool3 = MaxPooling3D(pool_size=(2, 2, 2), padding='same', name='pool3')(encode3b)

    # Bottleneck
    #--------------------------
    bottom_a = Conv3D(filters=8*filters, kernel_size=kernel, activation='relu', padding='same')(pool3)
    bottom_b = Conv3D(filters=8*filters, kernel_size=kernel, activation='relu', padding='same')(bottom_a)

    # Decoding 
    #--------------------------
    up2   = Concatenate(axis=4)([Conv3DTranspose(filters=4*filters, kernel_size=(2,2,2), strides=strides_2, padding='same')(bottom_b), encode3b])
    decode2a = Conv3D(filters=4*filters, kernel_size=kernel, activation='relu', padding='same',name='decode1a')(up2)
    decode2b = Conv3D(filters=4*filters, kernel_size=kernel, activation='relu', padding='same',name='decode1b')(decode2a)

    up3   = Concatenate(axis=4)([Conv3DTranspose(filters=2*filters, kernel_size=(2,2,2), strides=strides_2, padding='same')(decode2b), encode2b])
    decode1a = Conv3D(filters=2*filters, kernel_size=kernel, activation='relu', padding='same',name='decode2a')(up3)
    decode1b = Conv3D(filters=2*filters, kernel_size=kernel, activation='relu', padding='same',name='decode2b')(decode1a)

    up4   = Concatenate(axis=4)([Conv3DTranspose(filters=filters, kernel_size=(2,2,2), strides=strides_2, padding='same')(decode1b), encode1b])
    decode0a = Conv3D(filters=filters, kernel_size=kernel, activation='relu', padding='same',name='decode3a')(up4)
    decode0b = Conv3D(filters=filters, kernel_size=kernel, activation='relu', padding='same',name='decode3b')(decode0a)

    # Output
    flatten = Convolution3D(filters=4, kernel_size=(1,1,1), activation='softmax')(decode0b)
    model = Model(inputs=ins, outputs=flatten, name=model_name)
    return model


FILTERS = 32
KERNEL = (3,3,3)
MODEL_NAME = 'multi-unet-test'
LR = 3e-3

strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
with strategy.scope():
    model = nets.unet((None,None,None,1),FILTERS,KERNEL,model_name=MODEL_NAME)
    model.compile(optimizer=nets.Adam(lr=LR),loss=tf.keras.losses.SparseCategoricalCrossentropy(),metrics=['accuracy'])
model.summary()

X_train, Y_train = load_dataset_all(FILE_DEN,FILE_MSK,SUBGRID)
# this is a function for loading input and mask fields
# outputs shapes of [256,128,128,128,4]

history = model.fit(X_train, Y_train, batch_size = 4, epochs = 50, verbose = 2, shuffle = True, validation_split = 0.2)
model.save(MODEL_NAME)

# Load and predict 
# this is actually in another script but I'm putting this all in one go:
model = load_model(MODEL_NAME)
model.compile(loss=model.loss,optimizer=model.optimizer,metrics=['accuracy'])
# load test data:
X_test = load_dataset()
Y_test = model.predict(X_test, batch_size = 4, verbose = 2)

After some Googling and looking at other questions on stack overflow, people seem to suggest two solutions: adjusting the batch size so that the number of samples is divisible by it, and switching to different versions of TF/CUDA. Originally my X_test had a shape of [343,128,128,128,4] but I chopped off 3 samples to get it to [340,128,128,128,4] so that it's divisible by my batch size of 4.

The first test was using tf version 2.4.1 and CUDA version 11.6. I tried the same code on Colab with tf version 2.9.2 and CUDA version 11.2 and got the same error, so I doubt that's the problem.

Any advice or help would be greatly appreciated. Let me know if there's any other information I can provide.

Thank you!!!

Solution

I had the exact same problem and it's now gone. I changed a few things and at some point the error message changed to "Split on GPU requires input size < max int32", so I'm not exactly sure what the problem was. Just wanted to give you a list of things I have changed, maybe one of it helps:

Change dtype of input and labels to bool (unintentionally used float before)
I use batch size 1 anyway
I decreased the number of filters in my ConvLayers
I use tf 2.6 / cudnn 8.2.1 / cudatoolkit 11.3.1

Generally I couldn't and still can't make sense of the error message ("invalid configuration argument") but think it's probably a memory problem? My model is even smaller than yours but our arrays are huge (my input is 128x128x128 and labels 512x512x512).

Hope that helps at least a bit.