Summary
I'd like some clarification on how the thrust::device_vector works.
AFAIK, writing to an indexed location such as device_vector[i] = 7 is implemented by the host, and therefore causes a call to memcpy. Does device_vector.push_back(7) also call memcpy?
Background
I'm working on a project comparing stock prices. The prices are stored in two vectors. I iterate over the two vectors, and when there's a change in their prices relative to each other, I write that change into a new vector. So I never know how long the resulting vector is going to be. On the CPU the natural way to do this is with push_back, but I don't want to use push_back on the GPU vector if its going to call memcpy every time.
Is there a more efficient way to build a vector piece by piece on the GPU?
Research
I've looked at this question, but it (and others) are focused on the most efficient way to access elements from the host. I want to build up a vector on the GPU.
Thank you.
Does device_vector.push_back(7) also call memcpy?
No. It does, however, result in a kernel launch per call.
Is there a more efficient way to build a vector piece by piece on the GPU?
Yes.
Build it (or large segments of it) in host memory first, then copy or insert to memory on the device in a single operation. You will greatly reduce latency and increase PCI-e bus utilization by doing so.