Search code examples
macosprofilingmemory-profilingdtrace

How does `dtrace` probe memory allocations (Mac OS)


Does anyone know what function / mechanism is dtrace using for tracking mallocs? I'm trying to profile a piece of code, which I can do with the aid of debugger and some command line scripting, i.e.:

sudo dtrace -n "pid`pgrep Mail | head -n 1`::malloc:entry { @sizes=quantize(arg0); }"

Gives me something like:

dtrace: description 'pid31411::malloc:entry ' matched 4 probes
^C

       value  ------------- Distribution ------------- count    
          -1 |                                         0        
           0 |                                         214      
           1 |                                         7        
           2 |                                         191      
           4 |                                         1054     
           8 |@@@@                                     15992    
          16 |@@@@@@@@@@@@                             44569    
          32 |@@@@@@@@@@                               37003    
          64 |@@@@                                     15426    
         128 |@@@@                                     15695    
         256 |@                                        2616     
         512 |@                                        1967     
        1024 |@                                        1891     
        2048 |@@                                       6010     
        4096 |                                         523      
        8192 |                                         43       
       16384 |                                         110      
       32768 |                                         19       
       65536 |                                         0        
      131072 |                                         69       
      262144 |                                         0        

But this is really tedious for me. I was wondering how to do this programmatically, from within the code.


Solution

  • I think you're viewing the problem the wrong way around. Your example shows a fairly sophisticated interpretation of an arbitrary argument in an arbitrary combination of process and function — being able to do that in a single line and without modifying your own program is extraordinarily powerful. Attempting to have your own code perform the same analysis makes no sense: what would you do if, e.g., you wanted a linear scale instead of a logarithmic one? Reimplement lquantize(), too?

    Focus on writing the code you want and let DTrace do the profiling.

    EDIT in response to the first comment.

    The execution path for the example you give is extremely circuitous. Very broadly, dtrace(1) requests that the kernel modify malloc's prologue so that, on entry, a calling thread traps to the DTrace kernel module. There, the datum is aggregated within a per-cpu buffer before control is returned to the instrumented thread. At periodic intervals, the dtrace process requests, via libdtrace, a snapshot of the kernel's per-CPU buffers via ioctl(2). Coalescing these buffers and then rendering the graph that you see are also functions performed by libdtrace. On macos, the libdtrace API, which includes the format of the records exchanged with the kernel, is private. Thus, reusing any of this infrastructure for even your simple example would be "using a sledgehammer to crack a nut".

    A further consideration is that you'll be adding code that will itself need to be debugged and maintained. If your code is sufficiently complex that it warrants its own instrumentation then it seems plausible that, one day, you will want to consider calloc(), realloc() and mmap(). Perhaps you will also want to explicitly include or exclude calls to these functions from not just your own code but other libraries against which it is linked.

    Finally, it will almost always be preferable to separate the code that implements your actual task from the code used to debug it. One example approach would be to write your own, instrumented wrapper for malloc() and put it in a shared object that you can interpose between your executable and, presumably, libc.