CUDA 5.5 samples compile fine on OS X 10.9 but error out immediately when run

This is on a MacBookPro7,1 with a GeForce 320M (compute capability 1.2). Previously, with OS X 10.7.8, XCode 4.x and CUDA 5.0, CUDA code compiled and ran fine.

Then, I update to OS X 10.9.2, XCode 5.1 and CUDA 5.5. At first, deviceQuery failed. I read elsewhere that 5.5.28 (the driver CUDA 5.5 shipped with) did not support compute capability 1.x (sm_10), but that 5.5.43 did. After updating the CUDA driver to the even more current 5.5.47 (GPU Driver verions 8.24.11 310.90.9b01), deviceQuery indeed passes with the following output.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 320M"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.2
  Total amount of global memory:                 253 MBytes (265027584 bytes)
  ( 6) Multiprocessors, (  8) CUDA Cores/MP:     48 CUDA Cores
  GPU Clock rate:                                950 MHz (0.95 GHz)
  Memory Clock rate:                             1064 Mhz
  Memory Bus Width:                              128-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS

Furthermore, I can successfully compile without modification the CUDA 5.5 samples, though I have not tried to compile all of them.

However, samples such as matrixMul, simpleCUFFT, simpleCUBLAS all fail immediately when run.

$ ./matrixMul 
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2

MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)

$ ./simpleCUFFT 
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2

CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"

Error Code 2 is cudaErrorMemoryAllocation, but I suspect it hides a failed CUDA initialization somehow.

$ ./simpleCUBLAS 
GPU Device 0: "GeForce 320M" with compute capability 1.2

simpleCUBLAS test running..
!!!! CUBLAS initialization error

Actual error code is CUBLAS_STATUS_NOT_INITIALIZED being returned from call to cublasCreate().

Has anyone run into this before and found a fix? Thanks in advance.

Solution

I would guess you are running out of memory. Your GPU is being used by the display manager, and it only has 256Mb of RAM. The combined memory footprint of the OS 10.9 display manager and the CUDA 5.5 runtime might be leaving you with almost no free memory. I would recommend writing and running a small test program like this:

#include <iostream>

int main(void)
{
    size_t mfree, mtotal;

    cudaSetDevice(0);
    cudaMemGetInfo(&mfree, &mtotal);

    std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;

    return cudaDeviceReset();
}

[disclaimer: written in browser, never compiled or tested use at own risk ]

That should give you a picture of the available free memory after context establishment on the device. You might be surprised at how little there is to work with.

EDIT: Here is an even lighter weight alternative test which doesn't even attempt to establish a context on the device. Instead, it only uses the driver API to check the device. If this succeeds, then either the runtime API shipping for OS X is broken somehow, or you have no memory available on the device for establishing a context. If it fails, then your truly have a broken CUDA installation. Either way, I would consider opening a bug report with NVIDIA:

#include <iostream>
#include <cuda.h>

int main(void)
{
    CUdevice d;
    size_t b;
    cuInit(0);
    cuDeviceGet(&d, 0);
    cuDeviceTotalMem(&b, d);

    std::cout << "Total memory = " << b << std::endl;

    return 0;
}

Note you will need to explicitly link the cuda driver library to get this to work (pass -lcuda to nvcc, for example)