How to get malloc to show up in nvprof's statistical profiler?

Is there a way to get CUDA's nvprof to include function calls like malloc in its statistical profiler?

I've been trying to improve the performance of my application. Naturally, I've been using nvprof as a tool in that effort.

Recently, in an effort to reduce the GPU memory footprint of my application, I wrote code that made it take twice as long to run. However, the new code that caused the slow-down was only showing up in the profiler in a small amount (the instruction sampling indicated that about 10% of the time was being spent in the new code, but a naive thought would indicate that 50% of the time should have been spent in the new code). Maybe the new code caused more cache thrashing, maybe putting the implementation in a header file so it could be inlined confused the profiler, etc. However, for no good reason, I suspected the new code's calls of malloc.

Indeed, after I reduced the number of malloc calls, my performance increased, almost back to where it was before incorporating the new code.

This lead me to a similar question of, why didn't the calls of malloc show up in the statistical profiler? Are the malloc calls some sort of GPU system call that can't be observed?

Below, I include an example program and its out that showcases this particular issue.

#include <iostream>
#include <numeric>
#include <thread>
#include <stdlib.h>
#include <stdio.h>

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

__global__ void countup()
{
    long sum = 0;
    for (long i = 0; i < (1 << 23); ++i) {
        sum += i;
    }
    printf("sum is %li\n", sum);
}

__global__ void malloc_a_lot() {
    long sum = 0;
    for (int i = 0; i < (1 << 17) * 3; ++i) {
        int * v = (int *) malloc(sizeof(int));
        sum += (long) v;
        free(v);
    }
    printf("sum is %li\n", sum);
}

__global__ void both() {
    long sum = 0;
    for (long i = 0; i < (1 << 23); ++i) {
        sum += i;
    }
    printf("sum is %li\n", sum);

    sum = 0;
    for (int i = 0; i < (1 << 17) * 3; ++i) {
        int * v = (int *) malloc(sizeof(int));
        sum += (long) v;
        free(v);
    }
    printf("sum is %li\n", sum);
}


int main(void)
{

    CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    std::chrono::time_point<std::chrono::system_clock> t1 = std::chrono::system_clock::now();
    countup<<<8,1>>>();
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    std::chrono::time_point<std::chrono::system_clock> t2 = std::chrono::system_clock::now();
    malloc_a_lot<<<8,1>>>();
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    std::chrono::time_point<std::chrono::system_clock> t3 = std::chrono::system_clock::now();
    both<<<8,1>>>();
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    std::chrono::time_point<std::chrono::system_clock> t4 = std::chrono::system_clock::now();

    std::chrono::duration<double> duration_1_to_2 = t2 - t1;
    std::chrono::duration<double> duration_2_to_3 = t3 - t2;
    std::chrono::duration<double> duration_3_to_4 = t4 - t3;
    printf("timer for countup() took %.3lf\n", duration_1_to_2.count());
    printf("timer for malloc_a_lot() took %.3lf\n", duration_2_to_3.count());
    printf("timer for both() took %.3lf\n", duration_3_to_4.count());

    return 0;
}

static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
    if (err == cudaSuccess)
        return;
    std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
    exit (1);
}

An elided version of the results is:

sum is 35184367894528...
sum is -319453208467532096...
sum is 35184367894528...
sum is -319453208467332416...
timer for countup() took 4.034
timer for malloc_a_lot() took 4.306
timer for both() took 8.343

A profiling result is shown in the following graphic. The numbers that show up when mousing-over the light blue bars are consistent with the size of the bars. Specifically, Line 41 has 16,515,077 samples associated with it, but Line 47 only has 633,996 samples.

BTW, the program above is compiled with debug information and presumably no optimization -- the default "Debug" mode for compiling in Nsight Eclipse. If I compile in "Release" mode, optimization is invoked, and the countup() call's duration is very close to 0 seconds.

Solution

The current NVIDIA GPU PC Sampler only collects the current warp program counter (not a call stack). The PC sampler will correctly collect samples inside of malloc; however, the tool does not show SASS or high level source for internal syscalls.

The tool does not have UI to show aggregated count for samples inside a syscall module.
The tool does not know the PC ranges for malloc, free, or other syscalls to correctly attribute the samples to the user called syscall.

If (1) or (2) is fixed the data would be shown on a separate row simply labelled "syscall" or "malloc". The hardware does not collect call stacks so it is not possible to attribute the samples to L48.