Search code examples
ccudastack-overflowunspecified-behavior

CUDA: "Stack Overflow or Breakpoint Hit" and unspecified launch failure error after copying char array from host to device


I have a large char array in my main program that I copy in chunks to the device memory. I run about 500,000 threads in my program and each thread accesses 2000 chars. So I transfer 500,000 * 2000 = 1GB bytes at a time with the code

err = cudaMemcpy (dev_database, adjusted_database[k], JOBS * 2000 * sizeof(char), cudaMemcpyHostToDevice);
if(err != cudaSuccess) { printf("CUDA error: %s\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); }

In my kernel I also define three shared arrays

//__shared__ char dev_query[200];
__shared__ float dev_scores[200*5];
__shared__ int dev_index[26];

and initialize them with

if(threadIdx.x == 0) { 
  //for(i = 0; i < 200; i++){ dev_query[i] = dev_query_constant[i]; }
  for(i = 0; i < 200 * 5; i++){ dev_scores[i] = dev_scores_constant[i]; }
  for(i = 0; i < 26; i++){ dev_index[i] = dev_index_constant[i]; }
}
__syncthreads(); 

If I run my program with the two lines commented my kernel returns strange values and when I copy the second chunk of the char array I get the error

CUDA error: unspecified launch failure

If I uncomment the lines in the code above everything works fine. If I copy smaller chunks of the array, such as 100MB instead of 1GB, its works fine until I get to the 6th chunk where I get the same errors as above.

This is very strange behaviour and I'd like to understand why is this happening. Is there a bug somewhere that is causing this? It's hard to pinpoint it because the program works fine if I transfer a small chunk (such as 100MB) and ignore the other ones. It also works fine if I uncomment the lines related to the shared variables or change the shared variables to constant. Any help would be greatly appreciated. Thanks!

EDIT: Here is my kernel. To summarize, I'm computing the similarity score of two strings by comparing their ith character for all i between 0 and their lengths. This code below will produce the above error unless you uncomment the line immediately after if(threadIdx.x == 0) {. Or if you replace the shared arrays below with constant ones then it also works fine.

__global__ void assign7(int jobs_todo, char* database, float* results, int flag) {
unsigned int id = threadIdx.x + blockIdx.x * blockDim.x;

if(id < jobs_todo) {
__shared__ char dev_query[200];
__shared__ float dev_pos_specific_scores[200*5];
__shared__ int dev_subst_index[26];

int j_, i, p, stop, k; //stop2;
float score=0, max=0;
char ch; //ch1, ch2;

if(threadIdx.x == 0) {
//for(i = 0; i < 51; i++){ dev_query[i] = dev_query_constant[i]; }
  for(i = 0; i < 5 * 200; i++){ dev_pos_specific_scores[i] = dev_pos_specific_scores_constant[i]; }
  for(i = 0; i < 26; i++){ dev_subst_index[i] = dev_subst_index_constant[i]; }
}
__syncthreads(); 

for(i = 1; i <= 2000 - 51; i += 1){
  p = jobs_todo*(i-1);
  score = 0;
  stop = 51/1; stop = stop*1;
  for(j_ = 1; j_ <= stop; j_ += 1){
    k = (j_-1)*5;
    ch = database[p + id];
    score += dev_pos_specific_scores[k + dev_subst_index[ch - 'A']];
    if(score < 0) score = 0;
    if(score > max) max = score;                                      
    p += jobs_todo;
  }
}
results[id] = max;
}
}

Solution

  • There were some characters in the data that caused dev_index[ch-'A'] to return -1. This made the index of dev_scores -1 when k = 0. I believe that was the source of the memory error in my code. I commented everything and incrementally uncommented portions. It works fine now. Thanks @talonmies, @harrism, and @perreal for your comments!