Search code examples

Difference between Benchmarking and Profiling

I see the terms software benchmarking and profiling used sometimes interchangeably but as far as my understanding goes there's a subtile difference.

Both are connected by time. But whereas benchmarking is mainly about determining a certain speed score that can be compared with other applications, profiling gives you exact information about where your application spends most of its time (or number of cycles).

For me it was always like: integration testing is the counterpart to benchmarking and unit tesing the counterpart to profiling. But how does micro-benchmarking fit in this?

Someone stated here:

Profiling and benchmarking are flip sides of the same coin, profiling helps you to narrow down to where optimization would be most useful, benchmarking allows you to easily isolate optimizations and cross-compare them.

Another one said here about Profiling:

Profiling means different things at different times. Sometimes it means measuring performance. Sometimes it means diagnosing memory leaks. Sometimes it means getting visibility into multi-threading or other low-level activities.

So, are those techniques conceptually different or is just not that black and white?


  • A benchmark is something that measures the time for some whole operation. e.g. I/O operations per second under some workload. So the result is typically a single number, in either seconds or operations per second. Or a data set with results for different parameters, so you can graph it.

    You might use a benchmark to compare the same software on different hardware, or different versions of some other software that your benchmark interacts with. e.g. benchmark max connections per second with different apache settings.

    Profiling is not aimed at comparing different things: it's about understanding the behaviour of a program. A profile result might be a table of time taken per function, or even per instruction with a sampling profiler. You can tell it's a profile not a benchmark because it makes no sense to say "that function took the least time so we'll keep that one and stop using the rest".

    Read the wikipedia article to learn more about it:

    You use a profile to figure out where to optimize. A 10% speedup in a function where your program spends 99% of its time is more valuable than a 100% speedup in any other function. Even better is when you can improve your high-level design so the expensive function is called less, as well as just making it faster.

    Microbenchmarking is a specific form of benchmarking. It means you're testing one super-specific thing to measure just that in isolation, not the overall performance of anything that's really useful.

    Example microbenchmark results:

    Example non-micro benchmark results:

    • compressing this 100MB collection of files took 23 seconds with 7-zip (with specific options and hardware).
    • compiling a Linux kernel took 99 seconds on some hardware / software combination.

    See also

    Micro-benchmarking is a special case of benchmarking. If you do it right, it tells you which operations are expensive and which are cheap, which helps you while trying to optimize. If you do it wrong, you probably didn't even measure what you set out to measure at all. e.g. you wrote some C to test for loops vs. while loops, but the compiler made different code for different reasons, and your results are meaningless. (Different ways to express the same logic almost never matter with modern optimizing compilers; don't waste time on this.) Micro-benchmarking is hard.

    The other way to tell it's a micro-benchmark is that you usually need to look at the compiler's asm output to make sure it's testing what you wanted it to test. (e.g. that it didn't optimize across iterations of your repeat-10M-times loop by hoisting something expensive out of the loop that's supposed to repeat the whole operation enough times to give duration that can be accurately measured.)

    Micro-benchmarking can distort things, because they test your function with caches hot and branch predictors primed, and they don't run any other code between invocations of the code under test. This can make huge loop unrolling look good, when as part of a real program it would lead to more cache misses. Similarly, it makes big lookup-tables look good, because the whole lookup table ends up in cache. The full program usually dirties enough cache between calls to the function that the lookup table doesn't always hit in cache, so it would have been cheaper just to compute something. (Most programs are memory-bound. Re-computing something not too complex is often as fast as looking it up.)