I am having some trouble with my Django/Celery/PyCuda setup. I am using PyCuda for some image processing on a Amazon EC2 G2 instance. Here is the info on my Cuda-capable GRID K520 card: Detected 1 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4096 MBytes (4294770688 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GRID K520
Result = PASS
I am using a pretty out-of-the-box celery config. I have a set of tasks defined in utils/tasks.py, which are tested and work before attempting to use PyCuda. I installed PyCuda via pip.
At the top of the file that I am having trouble with, I do my standard imports:
from celery import task
# other imports
import os
try:
import Image
except Exception:
from PIL import Image
import time
#Cuda imports
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy
A remote server initiates a task, which follows this basic workflow:
@task()
def photo_function(photo_id,...):
print 'Got photo...'
... Do some stuff ...
result = do_photo_manipulation(photo_id)
return result
def do_photo_manipulation(photo_id):
im = Image.open(inPath)
px = numpy.array(im)
px = px.astype(numpy.float32)
d_px = cuda.mem_alloc(px.nbytes)
... (Do stuff with the pixel array) ...
return new_image
This works if I run it in shell plus (ie, ./manage.py shell_plus) and if I run it as a standalone, outside-of-django-and-celery process. It's only in this context it fails, with the error: cuMemAlloc failed: not initialized
I have looked at other solutions for a while, and tried putting the import statement to do the initialization in the function itself. I have also plugged in a wait() statement, to ensure it's not a problem of the gpu being ready to do work.
Here is an answer that suggests the error comes from not importing pycuda.autoinit, which I have done: http://comments.gmane.org/gmane.comp.python.cuda/1975
Any help here would be appreciated!
If I need to provide any more information, just let me know!
EDIT: Here is the test code: def CudaImageShift(imageIn, mode = "luminosity" , log = 0):
if log == 1 :
print ("----------> CUDA CONVERSION")
# print "ENVIRON: "
# import os
# print os.environ
print 'AUTOINIT'
print pycuda.autoinit
print 'Making context...'
context = make_default_context()
print 'Context created.'
totalT0 = time.time()
print 'Doing test run...'
a = numpy.random.randn(4,4)
a = a.astype(numpy.float32)
print 'Test mem alloc'
a_gpu = cuda.mem_alloc(a.nbytes)
print 'MemAlloc complete, test mem copy'
cuda.memcpy_htod(a_gpu, a)
print 'memcopy complete'
[2014-07-15 14:52:20,469: WARNING/Worker-1] cuDeviceGetCount failed: not initialized
I believe the problem you experience is related to CUDA contexts. As of CUDA 4.0 a CUDA context is required per process and per device.
Behind the scenes celery will spawn processes for the task workers. When a process/task starts it will not have a context available. In pyCUDA the context creation happens in the autoinit module. That's why your code will work if you run it as a standalone (no extra process is created and the context is valid) or if you put the import autoinit
inside the CUDA task (Now the process/task will have a context, I believe you tried that already).
If you want to avoid the import you may be able to use the make_default_context
from pycuda.tools
although I'm not very familiar with pyCUDA and how it handles context management.
from pycuda.tools import make_default_context
@task()
def photo_function(photo_id,...):
ctx = make_default_context()
print 'Got photo...'
... Do some stuff ...
result = do_photo_manipulation(photo_id)
return result
Beware that context creation is an expensive process. CUDA deliberately front loads a lot of work in the context in order to avoid non expected delays later on. That's why you have a stack of contexts that you can push/pop between host threads (but not between processes). If your kernel code is very fast you may experience delays because of the context create/destroy procedure.