I have a large char array in my main program that I copy in chunks to the device memory. I run about 500,000 threads in my program and each thread accesses 2000 chars. So I transfer 500,000 * 2000 = 1GB bytes at a time with the code
err = cudaMemcpy (dev_database, adjusted_database[k], JOBS * 2000 * sizeof(char), cudaMemcpyHostToDevice);
if(err != cudaSuccess) { printf("CUDA error: %s\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); }
In my kernel I also define three shared arrays
//__shared__ char dev_query[200];
__shared__ float dev_scores[200*5];
__shared__ int dev_index[26];
and initialize them with
if(threadIdx.x == 0) {
//for(i = 0; i < 200; i++){ dev_query[i] = dev_query_constant[i]; }
for(i = 0; i < 200 * 5; i++){ dev_scores[i] = dev_scores_constant[i]; }
for(i = 0; i < 26; i++){ dev_index[i] = dev_index_constant[i]; }
}
__syncthreads();
If I run my program with the two lines commented my kernel returns strange values and when I copy the second chunk of the char array I get the error
CUDA error: unspecified launch failure
If I uncomment the lines in the code above everything works fine. If I copy smaller chunks of the array, such as 100MB instead of 1GB, its works fine until I get to the 6th chunk where I get the same errors as above.
This is very strange behaviour and I'd like to understand why is this happening. Is there a bug somewhere that is causing this? It's hard to pinpoint it because the program works fine if I transfer a small chunk (such as 100MB) and ignore the other ones. It also works fine if I uncomment the lines related to the shared variables or change the shared variables to constant. Any help would be greatly appreciated. Thanks!
EDIT:
Here is my kernel. To summarize, I'm computing the similarity score of two
strings by comparing their ith character for all i between 0 and their lengths.
This code below will produce the above error
unless you uncomment the line immediately after if(threadIdx.x == 0) {
. Or if you
replace the shared arrays below with constant ones then it also works fine.
__global__ void assign7(int jobs_todo, char* database, float* results, int flag) {
unsigned int id = threadIdx.x + blockIdx.x * blockDim.x;
if(id < jobs_todo) {
__shared__ char dev_query[200];
__shared__ float dev_pos_specific_scores[200*5];
__shared__ int dev_subst_index[26];
int j_, i, p, stop, k; //stop2;
float score=0, max=0;
char ch; //ch1, ch2;
if(threadIdx.x == 0) {
//for(i = 0; i < 51; i++){ dev_query[i] = dev_query_constant[i]; }
for(i = 0; i < 5 * 200; i++){ dev_pos_specific_scores[i] = dev_pos_specific_scores_constant[i]; }
for(i = 0; i < 26; i++){ dev_subst_index[i] = dev_subst_index_constant[i]; }
}
__syncthreads();
for(i = 1; i <= 2000 - 51; i += 1){
p = jobs_todo*(i-1);
score = 0;
stop = 51/1; stop = stop*1;
for(j_ = 1; j_ <= stop; j_ += 1){
k = (j_-1)*5;
ch = database[p + id];
score += dev_pos_specific_scores[k + dev_subst_index[ch - 'A']];
if(score < 0) score = 0;
if(score > max) max = score;
p += jobs_todo;
}
}
results[id] = max;
}
}
There were some characters in the data that caused dev_index[ch-'A']
to return -1. This
made the index of dev_scores
-1 when k = 0. I believe that was the source of the
memory error in my code. I commented everything and incrementally uncommented portions.
It works fine now. Thanks @talonmies, @harrism, and @perreal for your comments!