CUDA program gives cudaErrorIllegalAddress on sm_35 Kepler GPUs, but runs on fine on other GPUs

I'm having a very weird problem with my program. Essentially I'm doing a matrix multiplication on part of a matrix. The program apparently runs fine on most cards cards but crashes on sm_35 Kepler (=GK110) cards.

The initial program was written in PyCUDA, but I've since managed to boil it down to the following minimal example written in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

int main(int argc, char **argv)
{
    cudaError_t status;
    cublasStatus_t status_blas;
    CUresult status_drv;
    float *A = 0;
    float *B = 0;
    float *C = 0;
    float alpha = 1.0f;
    float beta = 0.0f;
    float *oldA, *oldB, *oldC;
    cublasHandle_t handle;
    int n = 131;
    int m = 2483;
    int k = 3;
    int i;
    CUcontext ctx;

    cuInit(0);
    status_drv = cuCtxCreate(&ctx, 0, 0);
    if (status_drv != CUDA_SUCCESS) {
        fprintf(stderr, "!!!! Context creation error: %d\n", status);
        return EXIT_FAILURE;
    }
    status_blas = cublasCreate(&handle);
    if (status_blas != CUBLAS_STATUS_SUCCESS) {
        fprintf(stderr, "!!!! CUBLAS initialization error\n");
        return EXIT_FAILURE;
    }

    for (i = 0; i < 5; ++i) {
        printf("Iteration %d\n", i);
        if (cudaMalloc((void **)&B, m * k * sizeof(B[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate B)\n");
            return EXIT_FAILURE;
        }
        if (cudaMalloc((void **)&C, m * m * sizeof(C[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate C)\n");
            return EXIT_FAILURE;
        }
        if (cudaMalloc((void **)&A, n * m * sizeof(A[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate A)\n");
            return EXIT_FAILURE;
        }
        int s = 3;
        float * A_slice = A + 128*m;
        status_blas = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, m, s,
            &alpha, A_slice, m, B, k, &beta, C, m);
        if (status_blas != CUBLAS_STATUS_SUCCESS) {
            fprintf(stderr, "!!!! kernel execution error.\n");
            return EXIT_FAILURE;
        }
        if (i == 0) {
            oldA = A;
            oldB = B;
            oldC = C;
        } else if (i == 1) {
            status = cudaFree(oldA);
            if (status != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free A, %d)\n", status);
                return EXIT_FAILURE;
            }
            if (cudaFree(oldB) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free B)\n");
                return EXIT_FAILURE;
            }
            if (cudaFree(oldC) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free C)\n");
                return EXIT_FAILURE;
            }
        }
    }
    cublasDestroy(handle);
    cuCtxDestroy(ctx);
    return 0;
}

I only free memory in the 2nd iteration of the for loop to mimic the behavior of the original python program. The program will crash in the 2nd iteration of the for-loop when it tries to free "A", with cudaFree returning a cudaErrorIllegalAddress error.

Concretely, the was tried on the following cards:

NVS 5400M -> no issues
GTX560Ti -> no issues
Tesla S2050 -> no issues
unknown sm_30 card (see comments to this post) -> no issues
K40c -> CRASH
GTX 780 -> CRASH
K20m -> CRASH

I used a number of Linux machines with different distributions, some of them using CUDA 5.5 and some using CUDA 6.0. At least on the machines I have direct control over, all cards were using the 331 nvidia driver series.

There are several things to note here:

the order of the malloc calls matters. If I allocate A before B things run fine
the numerical constants matter a bit. For some values (e.g. n=30) no crash occurs, for others there is a crash
The order of the free/malloc calls matter. If I free the memory in the same iteration where I allocate, everything works just fine

At this point I'm pretty desperate. I don't see why or where I'm doing anything wrong. If anyone could help me, I'd really appreciate it.

EDIT: as pointed out in the comments, apparently it only fails to run on sm_35 (i.e., GK110 cards), but runs fine on sm_30 Kepler cards.

Solution

This issue should be fixed in the CUDA 6.5 production release package, now available for download from http://www.nvidia.com/getcuda