How many threads can a Nvidia GPU launch?

How many threads can a Nvidia GTX 1050 4GB GPU launch? For example: kernel<<<1,32>>>(args); can launch 32 threads. So what is the maximum number of threads possible? I am aware of this post how many threads does nvidia GTS 450 has

Solution

This question may often arise from a misunderstanding of GPU execution behavior. However the limits according to the prompt you have given:

For example: kernel<<<1,32>>>(args); can launch 32 threads.

do not vary across GPUs supported by recent CUDA toolkits (i.e. GPUs of compute capability 3.0 through 8.6, which includes your GTX 1050). These limits are given in the documentation as well as via runtime queries such as demonstrated by the deviceQuery sample code.

These limits are that a threadblock (the second kernel launch configuration parameter) is limited to 1024 threads total, which is the product of the 3 dimensions, x,y,z, and each of those dimensions have individual limits:

                    x      y     z
threadblock      1024   1024    64
       grid    2^31-1  65535 65535

and likewise, as indicated above, the grid (the first kernel launch configuration parameter) has individual limits on each dimension, but no limit on the product.

The maximum number of threads that is possible to be specified in a kernel launch currently is therefore the product of the grid dimensions and the threadblock limit:

total = (2^31-1)*65535*65535*1024

That product is 9,444,444,733,164,249,676,800

Note that on most GPUs, a kernel launch this large, even with an empty kernel, will take a very long time to complete. (*)

The documentation covers thread hierarchy as well as how to specify multi-dimensional grids and threadblocks.

(*) For amusement, an empty kernel launch of <<<dim3(1,65535,65535),1024>>> takes about 1 minute to process on a GTX960. So a "maximal" empty kernel launch would take about 2^31 minutes to process (over 4000 years) on that GPU.