CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.

I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory? If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...

Does anyone know where I can read more about this?

Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.

Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?

I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.

Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.

I've found the manual fairly light on detail in general.

Solution

CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.

Responding to your additional posted questions, and following your cue to leave UM out of the consideration:

Is there any possible way to remap device memory? (ie, to grow a device allocation)

There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?

Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?

no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus