Search code examples
x86inteltracebranch-predictionintel-pmu

What is the overhead of using Intel Last Branch Record?


Last Branch Record refers to a collection of register pairs (MSRs) that store the source and destination addresses related to recently executed branches. http://css.csail.mit.edu/6.858/2012/readings/ia32/ia32-3b.pdf document has more information in case you are interested.

  • a) Can someone give an idea of how much LBR slows down program execution of common programs - both CPU and IO intensive ?
  • b) Will branch prediction be turned OFF when LBR tracing is ON ?

Solution

  • The paper Intel Code Execution Trace Resources (by Arium workers, Craig Pedersen and Jeff Acampora, Apr 29, 2012 ) lists three variants of branch tracing:

    • Last Branch Record (LBR) flag in the DebugCtlMSR and corresponding LastBranchToIP and LastBranchFromIP MSRs as well as LastExceptionToIP and LastExceptionFromIP MSRs.

    • Branch Trace Store (BTS) using either cache-as-RAM or system DRAM.

    • Architecture Event Trace (AET) captured off the XDP port and stored externally in a connected In-Target Probe.

    As said in page 2, LBR save information in MSRs, "does not impede any real-time performance," but is useful only for very short code ("effective trace display is very shallow and typically may only show hundreds of instructions."). Only saves info about 4-16 branches.

    BTS allows to capture many pairs of branch "From"s and "To"s, and stores them in cache (Cache-as-RAM, CAR) or in system DRAM. In case of CAR, trace depth/length is limited by cache sizes (and some constant); with DRAM trace length is almost unlimited. The paper estimates overhead of BTS as from 20 up to 100 percents due to additional memory stores. BTS on Linux is easy to use with proposed perf branch record (not yet in vanilla) or btrax project. perf branch presentation gives some hints about BTS organisation: there is BTS buffer, which contains "from", "to" fields, and "predicted flag". So, branch prediction is not turned off when using BTS. Also, when BTS buffer is filled up to max size, interrupt is generated. BTS-handling module in kernel (perf_events subsystem or btrax kernel module) should copy data from BTS buffer to other location in case of such interrupt.

    So, in BTS mode there are two sources of overhead: Cache/Memory stores and interrupts from BTS buffer overflow.

    AET uses external agent to save debug and trace data. This agent is connected via eXtended Debug Port (XDP) and interfaces with In-Target Probe (ITP). Overhead of AET "can have a significant effect on system performance, which can be several orders of magnitude greater" according to this paper, because AET can generate/capture more types of events. But the collected data storage is external to debugged platform.

    Paper's "Summary" says: 

    LBR has no overhead, but is very shallow (4–16 branch locations, depending on the CPU). Trace data is available immediately out of reset.

    BTS is much deeper, but has an impact on CPU performance and requires on-board RAM. Trace data is available as soon as CAR is initialized.

    AET requires special ITP hardware and is not available on all CPU architectures. It has the advantage of storing the trace data off board.