This is on a MacBookPro7,1 with a GeForce 320M (compute capability 1.2). Previously, with OS X 10.7.8, XCode 4.x and CUDA 5.0, CUDA code compiled and ran fine.
Then, I update to OS X 10.9.2, XCode 5.1 and CUDA 5.5. At first, deviceQuery
failed. I read elsewhere that 5.5.28 (the driver CUDA 5.5 shipped with) did not support compute capability 1.x (sm_10), but that 5.5.43 did. After updating the CUDA driver to the even more current 5.5.47 (GPU Driver verions 8.24.11 310.90.9b01), deviceQuery
indeed passes with the following output.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors, ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS
Furthermore, I can successfully compile without modification the CUDA 5.5 samples, though I have not tried to compile all of them.
However, samples such as matrixMul
, simpleCUFFT
, simpleCUBLAS
all fail immediately when run.
$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)
$ ./simpleCUFFT
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"
Error Code 2 is cudaErrorMemoryAllocation
, but I suspect it hides a failed CUDA initialization somehow.
$ ./simpleCUBLAS
GPU Device 0: "GeForce 320M" with compute capability 1.2
simpleCUBLAS test running..
!!!! CUBLAS initialization error
Actual error code is CUBLAS_STATUS_NOT_INITIALIZED being returned from call to cublasCreate()
.
Has anyone run into this before and found a fix? Thanks in advance.
I would guess you are running out of memory. Your GPU is being used by the display manager, and it only has 256Mb of RAM. The combined memory footprint of the OS 10.9 display manager and the CUDA 5.5 runtime might be leaving you with almost no free memory. I would recommend writing and running a small test program like this:
#include <iostream>
int main(void)
{
size_t mfree, mtotal;
cudaSetDevice(0);
cudaMemGetInfo(&mfree, &mtotal);
std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;
return cudaDeviceReset();
}
[disclaimer: written in browser, never compiled or tested use at own risk ]
That should give you a picture of the available free memory after context establishment on the device. You might be surprised at how little there is to work with.
EDIT: Here is an even lighter weight alternative test which doesn't even attempt to establish a context on the device. Instead, it only uses the driver API to check the device. If this succeeds, then either the runtime API shipping for OS X is broken somehow, or you have no memory available on the device for establishing a context. If it fails, then your truly have a broken CUDA installation. Either way, I would consider opening a bug report with NVIDIA:
#include <iostream>
#include <cuda.h>
int main(void)
{
CUdevice d;
size_t b;
cuInit(0);
cuDeviceGet(&d, 0);
cuDeviceTotalMem(&b, d);
std::cout << "Total memory = " << b << std::endl;
return 0;
}
Note you will need to explicitly link the cuda driver library to get this to work (pass -lcuda to nvcc, for example)