Need help with 2D textures on CUDA. Not getting the expected value for the indexes I want

I have two matrices (A and B), 2-dimensions. And i think i'll speed up a bit if instead of putting it into global memory and access via pointers, i put them into textures 2d and use it. The matrices aren't that big, and different positions are read by different threads.

So right now my code is using global memory, and the values i get are correct, i am multiplying for every value in the matrix:

A[i][j] * B[ p[i] ] [ p[j] ]

The optimal value for the instance i'm testing is 9552, can't get a different value.

So i moved into textures, and it seems like some fetches return a wrong value, because i got a 9511 right now.

I was searching about textures on CUDA, and i saw they are indexed by [0..n-1]. But they have some normalized access, and a couple other stuff, like filtering, where the value you want is an interpolation of neighbors.

What's the default options for a texture? Maybe that's the problem. Couldn't find the defaults on the Programming Guide.

here's the relevant code:

Declaration:

texture<float,2> A_matrix;
texture<float,2> B_matrix;

Allocation:

HANDLE_ERROR( cudaMalloc( (void**)&_A, n * n * sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&_B, n * n * sizeof(float) ) );

Memcpy

HANDLE_ERROR( cudaMemcpy( _A, A, n * n * sizeof(float), cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( _B, B, n * n * sizeof(float), cudaMemcpyHostToDevice ) );

Binding and Descriptors (created two because i'm silly)

cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();
    cudaChannelFormatDesc desc2 = cudaCreateChannelDesc<float>();

    HANDLE_ERROR( cudaBindTexture2D( NULL, A_matrix,
                                   _A,
                                   desc, n, n,
                                   sizeof(float) * n ) );
    HANDLE_ERROR( cudaBindTexture2D( NULL, B_matrix,
                                   _B,
                                   desc2, n, n,
                                   sizeof(float) * n ) );

And where i use it

res += tex2D(A_matrix, i, j)  * tex2D(B_matrix, p[i], p[j]);

So how can I use textures correctly? Or are they meant to be like this?

EDIT:

This is the code that uses this memory access, the commented line doesn't use textures, and WORKS PERFECTLY.

__device__ inline float datastruct::getPermutationValue(int* p)
{
    float res = 0;
    for(int i = 0 ; i < ints[data_n] ; i++)
        for(int j = 0 ; j < ints[data_n] ; j++)
            res += tex2D(A_matrix, i, j)  * tex2D(B_matrix, p[i], p[j]);
            //res += qap_A[i * ints[data_n] + j]  * qap_B[p[i] * ints[data_n] + p[j]];
    return res;
}

Solution

Sorry 2D textures in CUDA can be a pain. I have a simple code thats only about 150 lines. I posted it on the nvidia forums. I believe you may need one of the sdk libraries to compile. I haven't had too much luck with textures for more complicated operations. I hope this helps you, it should compile for you.