cuLaunchKernel failed: too many resources requested for launch

I have been trying to parallelize a code of mine using pycuda. I need to initialize 10^5 threads with each thread running around 4000 iterations. This should be well withing the block and grid limits of my GPU (grid = (98,1,1), block = (1024,1,1)). However executing the program gives me the following error: "cuLaunchKernel failed: too many resources requested for launch"

Here's the code (please don't worry about the cuda kernel functions, I have tested them separately in a .cu file and they work completely fine):

import numpy as np
import matplotlib.pyplot as plt

import pycuda.driver as cuda
import pycuda.gpuarray as gpuarray
from pycuda.compiler import SourceModule
import pycuda.autoinit

mod = SourceModule("""
#include<math.h>

__device__ void iterate(double r,double *x,double *y,int n){
    
    for(int i=0;i<n;i++){
        *x = r * (3 * *y + 1) * *x * (1 - *x);
        *y = r * (3 * *x + 1) * *y * (1 - *y);
    }
    
}

__global__ void calc_lyap(double* arr,double* lyap,int n){

    int blocknum = blockIdx.z * (gridDim.x * gridDim.y) + blockIdx.y * (gridDim.x) + blockIdx.x;
    int threadnum = threadIdx.z * (blockDim.x * blockDim.y) + threadIdx.y * (blockDim.x) + threadIdx.x;
    
    int index = blocknum * (blockDim.x * blockDim.y * blockDim.z) + threadnum;
    
    double d0 = pow(10,-12);
    
    double r = arr[index];
    
    double x1=0.1,y1=0.1;
    
    iterate(r,&x1,&y1,1000);
    
    double x2 = x1, y2 = x1 + d0;
    
    double sum=0;
    
    for(int i=0;i<n;i++){
        iterate(r,&x1,&y1,1);
        iterate(r,&x2,&y2,1);
            
        double d1 = sqrt(pow((x1-x2),2) + pow((y1-y2),2));
        
        if(d1!=0){
            sum+=log2(d1/d0);
        }
        
        x2 = x1 + d0 * (x2 - x1) / d1;
        y2 = y1 + d0 * (y2 - y1) / d1;
    }
    
    sum = sum/n;
    
    lyap[index] = sum;  
}
                   
""")

lyap = mod.get_function("calc_lyap")

arr_d = gpuarray.to_gpu(np.linspace(0.4,1.2,10**5))
lyap_d = gpuarray.to_gpu(np.zeros(10**5))
n = gpuarray.to_gpu(np.array([3000]))

lyap(arr_d,lyap_d,n[0],grid=(10**5//1024+1,1,1),block=(1024,1,1))

lyap_ = lyap_d.get()
print(lyap_)

I tried reducing the size of the problem to just a sample i.e I changed 10^5 to just 10 and the block and grid dimensions to grid=(1,1,1) and block=(10,1,1) but it still yields the same error.

Python version: 3.10.8
Pycuda version: 2022.2.2
Compiler version: nvcc 11.8.89
OS: Windows
GPU: Nvidia RTX 3050 Mobile Laptop GPU

Solution

Going through the CUDA Documentation I found the docs for "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701", which mentions that this error not only occurs when you have too many arguments but also when your arguments are of the wrong type i.e passing int64 values when you have used "int" in c which is typically 32 bytes. My mistake here lied in the initialization of the variable n I pass as an parameter.

n = gpuarray.to_gpu(np.array([3000]))

The first mistake was that np.array automatically intializes your array to float64, moreover the error still persisted when I changed the code to:

n = gpuarray.to_gpu(np.array([3000]).astype(np.int32))

However it finally worked when I initialized n as,

n = np.int32(3000)
lyap(arr_d,lyap_d,n,grid=(10**5//1024+1,1,1),block=(1024,1,1))

So my mistake was that I passed a parameter of the wrong type, although I dont understand why it would work when I initialized it as an int32 array and passed the index 0 value as the parameter. I'm assuming it has something to do with how pycuda and numpy store their array elements.