I have 3 different thrust-based implementations that perform certain calculations: first is the slowest and requires the least of GPU memory, second is the fastest and requires the most of GPU memory, and the third one is in-between. For each of those I know the size and data type for each device vector used so I am using vector.size()*sizeof(type) to roughly estimate the memory needed for storage.
So for a given input, based on its size, I would like to decide which implementation to use. In other words, determine the fastest implementation that will fit is in the available GPU memory.
I think that for very long vectors that I am dealing with, the size of the vector.data() that I am calculating is a fairly good estimate and the rest of the overhead (if any) could be disregarded.
But how would I estimate the memory usage overhead (if any) associated with the thrust algorithms implementation? Specifically I am looking for such estimates for transform, copy, reduce, reduce_by_key, and gather. I do not really care about the overhead that is static and is not a function of the algorithm input and output parameters sizes unless it’s very significant.
I understand the implication of the GPU memory fragmentation, etc. but let’s leave this aside for a moment.
Thank you very much for taking the time to look into this.
Thrust is intended to be used like a black box and there is no documentation of the memory overheads of the various algorithms that I am aware of. But it doesn't sound like a very difficult problem to deduce it empirically by running a few numerical experiments. You might expect the memory consumption of a particular alogrithm to be approximable as:
total number of words of memory consumed = a + (1 + b)*N
for a problem with N
input words. Here a
will be the fixed overhead of the algorithm and 1+b
the slope of best fit memory versus N
line. b
is then the amount of overhead the algorithm per input word.
So the question then becomes how to monitor the memory usage of a given algorithm. Thrust uses an internal helper function get_temporary_buffer
to allocate internal memory. The best idea would be to writeyour own implementation of get_temporary_buffer
which emits the size it has been called with, and (perhaps) uses a call to cudaGetMemInfo
to get context memory statistics at the time the function gets called. You can see some concrete examples of how to intercept get_temporary_buffer
calls here.
With a suitably instrumented allocator and some runs with it at a few different problem sizes, you should be able to fit the model above and estimate the b
value for a given algorithm. The model can then be used in your code to determine safe maximum problem sizes for a given about of memory.
I hope this is what you were asking about...