nvcc compiler not optimizing

Why is the compiler not doing some trivial optimizations that can be done in the kernel? I have the following code for matrix multiplication:

__global__ void matrixMultiply(float * A, float * B, float * C,
        int numARows, int numAColumns,
        int numBRows, int numBColumns,
        int numCRows, int numCColumns) {

    int n=numAColumns;
    int Row=blockIdx.x*blockDim.x+threadIdx.x;
    int Col=blockIdx.y*blockDim.y+threadIdx.y;
    if((Row<numCRows) && (Col<numCColumns)){
        for(int k=0;k<n;++k){
                    C[Row*numCColumns+Col]+=
                        A[Row*numAColumns+k]*B[k*numBColumns+Col];
        }
    }   
}

The example would go much faster if I use a temporal register Cvalue to store the sum:

__global__ void matrixMultiply(float * A, float * B, float * C,
        int numARows, int numAColumns,
        int numBRows, int numBColumns,
        int numCRows, int numCColumns) {

    int n=numAColumns;
    int Row=blockIdx.x*blockDim.x+threadIdx.x;
    int Col=blockIdx.y*blockDim.y+threadIdx.y;
    if((Row<numCRows) && (Col<numCColumns)){
        float Cvalue=0;
        for(int k=0;k<n;++k){
            Cvalue+=A[Row*numAColumns+k]*B[k*numBColumns+Col];
        }
        C[Row*numCColumns+Col]=Cvalue;
    }   
}

In the last case, global memory for C is accessed only once whereas in the first case it is accessed many times in the loop. Isn't this kind of optimizations normally done by compilers? Both codes have a difference of about 30% in performance in my tests, I'm doing nvcc -O3 ...

Solution

Since C is not declared as __restrict__ the compiler cannot know whether C is the same matrix as A or B, so it cannot perform the optimization mentioned. When I instead used float* __restrict__ C, the times for both became practically the same. Thanks Chris Dodd.