Search code examples
cudathrust

How to estimate GPU memory requirements for thrust based implementation?


I have 3 different thrust-based implementations that perform certain calculations: first is the slowest and requires the least of GPU memory, second is the fastest and requires the most of GPU memory, and the third one is in-between. For each of those I know the size and data type for each device vector used so I am using vector.size()*sizeof(type) to roughly estimate the memory needed for storage.

So for a given input, based on its size, I would like to decide which implementation to use. In other words, determine the fastest implementation that will fit is in the available GPU memory.

I think that for very long vectors that I am dealing with, the size of the vector.data() that I am calculating is a fairly good estimate and the rest of the overhead (if any) could be disregarded.

But how would I estimate the memory usage overhead (if any) associated with the thrust algorithms implementation? Specifically I am looking for such estimates for transform, copy, reduce, reduce_by_key, and gather. I do not really care about the overhead that is static and is not a function of the algorithm input and output parameters sizes unless it’s very significant.

I understand the implication of the GPU memory fragmentation, etc. but let’s leave this aside for a moment.

Thank you very much for taking the time to look into this.


Solution

  • Thrust is intended to be used like a black box and there is no documentation of the memory overheads of the various algorithms that I am aware of. But it doesn't sound like a very difficult problem to deduce it empirically by running a few numerical experiments. You might expect the memory consumption of a particular alogrithm to be approximable as:

    total number of words of memory consumed = a + (1 + b)*N
    

    for a problem with N input words. Here a will be the fixed overhead of the algorithm and 1+b the slope of best fit memory versus N line. b is then the amount of overhead the algorithm per input word.

    So the question then becomes how to monitor the memory usage of a given algorithm. Thrust uses an internal helper function get_temporary_buffer to allocate internal memory. The best idea would be to writeyour own implementation of get_temporary_buffer which emits the size it has been called with, and (perhaps) uses a call to cudaGetMemInfo to get context memory statistics at the time the function gets called. You can see some concrete examples of how to intercept get_temporary_buffer calls here.

    With a suitably instrumented allocator and some runs with it at a few different problem sizes, you should be able to fit the model above and estimate the b value for a given algorithm. The model can then be used in your code to determine safe maximum problem sizes for a given about of memory.

    I hope this is what you were asking about...