Concurrent: Short copy, Long kernel

When running concurrent copy & kernel operations:
If I have a kernel runTime that is twice as long as a dataCopy operation, will I get 2 copies per kernel run?
The stream examples I'm seeing show a 1:1 relationship. (Time of copy = time of kernel run.) I'm wondering what happens when there is something different. Is there always one copy operation (max) for every kernel launch? Or does the copy operation run independent of the kernel launch? i.e. I could possibly complete 5 copy operations for every kernel launch, if the run & copy time work out that way.
(I'm trying to figure out how many copy operations to queue up before a kernel launch.)

One to one: (time to copy = kernel run time)
<--stream1Copy--><--stream2Copy-->
..............................<-stream1Kernel->

Two to one: (time to copy = 1/2 kernel run time)
<-stream1Copy-><-stream2Copy-><-stream3Copy->
............................<----------stream1Kernel------------>

Solution

You can have more than one copy per kernel launch. Only one copy (per direction on devices with dual copy engines) can be running at a particular time to a particular GPU, but once that one is complete, another can be started immediately. Asynchronous copies issued in streams other than the kernel launch stream in question will run completely asynchronously to that kernel launch, assuming niether stream is stream 0. (This also assumes you are using pinned memory i.e. cudaHostAlloc to create the relevant host-side buffers.)

You may want to read the relevant section in the best practices guide.

The reason you frequently see a 1:1 analysis of compute and copy is that it is assumed the copied data will be consumed by (or is produced by) the kernel call, and so logically we can think of the block of data this way. But if it's easier to structure your code as a sequence of copies, there should be no problem with that. Naturally if you can batch up all your data into a single cudaMemcpy call, that will be slightly more efficient that a sequence of copies that are transferring the same data.

The visual profiler will help you see exactly what is going on comparing data copy operations to kernel operations, in a timeline fashion.