Search code examples
openclbuffer-objects

OpenCL Buffer Creation


I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.

I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:

If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)

To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.

So what is actually happening?

What I would like to know what do the flags:

  • CL_MEM_USE_HOST_PTR
  • CL_MEM_COPY_HOST_PTR
  • CL_MEM_ALLOC_HOST_PTR

imply about the location of buffer.

Thank you


Solution

  • The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.

    First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.

    Regarding allocation modes:

    • CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.
    • CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).
    • Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.

    Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.

    Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.