Read cudaArray in device code

Is there a way to read the values in a cudaArray from the device without wrapping it in a texture reference/object? All of the examples I've looked at use cudaArrays exclusively for creating textures. Is that the only way they can be used, or could I do something like:

__global__ kernel(cudaArray *arr, ...) {
    float x = tex1D<float>(arr, ...);
    ...
}

cudaArray *arr;
cudaMallocArray(&arr, ...);
cudaMemcpyToArray(arr, ...);
kernel<<<...>>>(arr, ...);

So basically, what should go in place of tex1D there? Also, if this is possible I'd be curious if anyone thinks there would be any performance benefit to doing this, but I'll also be running my own tests to see.

Thanks!

Solution

cudaArray is defined for texturing or surface memory purposes. As indicated here:

CUDA arrays are opaque memory layouts optimized for texture fetching. They are one dimensional, two dimensional, or three-dimensional and composed of elements, each of which has 1, 2 or 4 components that may be signed or unsigned 8 , 16 or 32 bit integers, 16 bit floats, or 32 bit floats. CUDA arrays are only accessible by kernels through texture fetching as described in Texture Memory or surface reading and writing as described in Surface Memory.

So in effect you have to use either texture functions or surface functions in kernels to access data in a cudaArray.

There are several performance benefit possibilities associated with using texturing. Texturing can imply interpolation (i.e. reading from a texture using floating point coordinates). Any application that needs this kind of data interpolation may benefit from the HW interpolation engines inside the texture units on the GPU.

Another benefit, perhaps the most important for using texturing in arbitrary GPU codes, is the texture cache that backs up the textures stored in global memory. Texturing is a read-only operation, but if you have an array of read-only data, the texture cache may improve or otherwise extend your ability to access data rapidly. This generally implies that there must be data-locality/ data-reuse in your functions that are accessing data stored in the texturing mechanism. Texture data retrieved will not disrupt anything in the L1 cache, so generally this kind of data segmentation/optimization would be part of a larger strategy around data caching. If there were no other demands on L1 cache, the texture mechanism/cache does not provide faster access to data than if it were in the L1 already.