I'd like to create an integer array of 100 and an another one of ~10-100 integers (varies by user input) on every thread. I will reuse the data in the array_views several times on a thread so I want to copy the content of the aray view as local data to enhance memory access time. (Every thread is responsible for its "own" 100 elements of the array_view, creating one thread for every element is not possible with my algorythm) If it is not possible, tile static memory will do the trick too, but the thread local one would be better.
My question is, how many bytes can I allocate on a thread as local variable/array(a minimum amount which will work on most GPU s)? Also, with which software can I query the capabilites of my GPU (Number of registers per thread, size of static memory per tile, etc.) The CUDA SDK has an utility app which queries the capabilities of the GPU, but I have an AMD one, Radeon HD 5770, and it won't work with my GPU if I am correct.
Opencl api can query gpu or cpu devices for capabilities of opencl programs but results should be similar for any natively optimized structure. But if your C++ AMP is based on HLSL or similar, you may not be able to use LDS.
32kB LDS and 24kB constant cache per compute unit means you can have 1kB LDS + 0.75kB per thread when you choose 32 threads per compute unit. But drivers may use some of it for other purposes, you can always test for different sizes. Look at constant cache bandwidth, its 2x performance of LDS bw.
If you are using those arrays without sharing with oter threads (or without any synchronization), you can use 256kB register space per compute unit(8kB per thread (setup of 32 threads per cu)) with six times wider bandwidth than LDS. But there are always some limits so actual usable value may be half of this.
taken from appendix - d of http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf