Difference between Benchmarking and Profiling

I see the terms software benchmarking and profiling used sometimes interchangeably but as far as my understanding goes there's a subtile difference.

Both are connected by time. But whereas benchmarking is mainly about determining a certain speed score that can be compared with other applications, profiling gives you exact information about where your application spends most of its time (or number of cycles).

For me it was always like: integration testing is the counterpart to benchmarking and unit tesing the counterpart to profiling. But how does micro-benchmarking fit in this?

Someone stated here:

Profiling and benchmarking are flip sides of the same coin, profiling helps you to narrow down to where optimization would be most useful, benchmarking allows you to easily isolate optimizations and cross-compare them.

Another one said here about Profiling:

Profiling means different things at different times. Sometimes it means measuring performance. Sometimes it means diagnosing memory leaks. Sometimes it means getting visibility into multi-threading or other low-level activities.

So, are those techniques conceptually different or is just not that black and white?

Solution

A benchmark is something that measures the time for some whole operation. e.g. I/O operations per second under some workload. So the result is typically a single number, in either seconds or operations per second. Or a data set with results for different parameters, so you can graph it.

You might use a benchmark to compare the same software on different hardware, or different versions of some other software that your benchmark interacts with. e.g. benchmark max connections per second with different apache settings.

Profiling is not aimed at comparing different things: it's about understanding the behaviour of a program. A profile result might be a table of time taken per function, or even per instruction with a sampling profiler. You can tell it's a profile not a benchmark because it makes no sense to say "that function took the least time so we'll keep that one and stop using the rest".

Read the wikipedia article to learn more about it: https://en.wikipedia.org/wiki/Profiling_(computer_programming)

You use a profile to figure out where to optimize. A 10% speedup in a function where your program spends 99% of its time is more valuable than a 100% speedup in any other function. Even better is when you can improve your high-level design so the expensive function is called less, as well as just making it faster.

Microbenchmarking is a specific form of benchmarking. It means you're testing one super-specific thing to measure just that in isolation, not the overall performance of anything that's really useful.

Example microbenchmark results:

Intel Haswell's L1 cache load-use latency is 4 cycles.
This version of memcpy achieves 80% of the throughput of the other version.
mov eax, ecx has 0c latency on Haswell, but mov ecx, ecx has 1c latency. (mov-elimination only works between different registers on Intel). See that link for full asm source of a static executable, and performance-counter results from running it with a couple different loop bodies to demonstrate mov-elimination.

Using CPU performance counters to measure how a micro-benchmark runs is a good way to do experiments to find out how CPUs work internally. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more examples of that. In that case, you're profiling your micro-benchmark to learn what's making it run at that speed. (Often you're more interested in the perf counters like uops_executed than you are in the actual time or clock cycle count, e.g. to test for micro-fusion / un-lamination without needing to make a loop where that actually affects cycles per iteration.)

Example non-micro benchmark results:

compressing this 100MB collection of files took 23 seconds with 7-zip (with specific options and hardware).
compiling a Linux kernel took 99 seconds on some hardware / software combination.

Micro-benchmarking is a special case of benchmarking. If you do it right, it tells you which operations are expensive and which are cheap, which helps you while trying to optimize. If you do it wrong, you probably didn't even measure what you set out to measure at all. e.g. you wrote some C to test for loops vs. while loops, but the compiler made different code for different reasons, and your results are meaningless. (Different ways to express the same logic almost never matter with modern optimizing compilers; don't waste time on this.) Micro-benchmarking is hard.

The other way to tell it's a micro-benchmark is that you usually need to look at the compiler's asm output to make sure it's testing what you wanted it to test. (e.g. that it didn't optimize across iterations of your repeat-10M-times loop by hoisting something expensive out of the loop that's supposed to repeat the whole operation enough times to give duration that can be accurately measured.)

Micro-benchmarking can distort things, because they test your function with caches hot and branch predictors primed, and they don't run any other code between invocations of the code under test. This can make huge loop unrolling look good, when as part of a real program it would lead to more cache misses. Similarly, it makes big lookup-tables look good, because the whole lookup table ends up in cache. The full program usually dirties enough cache between calls to the function that the lookup table doesn't always hit in cache, so it would have been cheaper just to compute something. (Most programs are memory-bound. Re-computing something not too complex is often as fast as looking it up.)