performancebit-manipulationbranch-predictionbranchless# How can I make branchless code and how does it work?

Related to this answer.

In the above answer, it's mentioned how you can avoid branch prediction fails by avoiding branches.

The user demonstrates this by replacing:

```
if (data[c] >= 128)
{
sum += data[c];
}
```

With:

```
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
```

How are these two equivalent (for the specific data set, not strictly equivalent)?

What are some general ways I can do similar things in similar situations? Would it always be by using `>>`

and `~`

?

Solution

```
int t = (data[c] - 128) >> 31;
```

The trick here is that if `data[c] >= 128`

, then `data[c] - 128`

is nonnegative, otherwise it is negative. The highest bit in an `int`

, the sign bit, is 1 if and only if that number is negative. `>>`

is a shift that extends the sign bit, so shifting right by 31 makes the *whole* result 0 if it used to be nonnegative, and all 1 bits (which represents -1) if it used to be negative. So `t`

is `0`

if `data[c] >= 128`

, and `-1`

otherwise. `~t`

switches these possibilities, so `~t`

is `-1`

if `data[c] >= 128`

, and `0`

otherwise.

`x & (-1)`

is always equal to `x`

, and `x & 0`

is always equal to `0`

. So `sum += ~t & data[c]`

increases `sum`

by `0`

if `data[c] < 128`

, and by `data[c]`

otherwise.

Many of these tricks can be applied elsewhere. This trick can certainly be generally applied to have a number be `0`

if and only if one value is greater than or equal to another value, and `-1`

otherwise, and you can mess with it some more to get `<=`

, `<`

, and so on. Bit twiddling like this is a common approach to making mathematical operations branch-free, though it's certainly not always going to be built out of the same operations; `^`

(xor) and `|`

(or) also come into play sometimes.

- Performance: XmlReader or LINQ to XML
- using Numba for scipy fsolve, but get error
- Performance degradation in AKS Windows node pool vs. virtual machine-hostes application in IIS
- Is f(int const) better than f(int) for compiler optimization?
- Collections and the Powershell Pipeline
- When should i use async/await in Python?
- Performance impact due to store.Open and store.Certificates.Find methods in X509Store
- JavaScript Set vs. Array performance
- Measure the time it takes to execute a t-sql query
- Why is windows so slow in opening files first time and is there a faster way
- How do I profile a Python script?
- What is the fastest way to decide if a digit appears in a string?
- Compare List of objects in java script
- How can I efficiently execute multiple C++ benchmarking algorithms using Windows cmd
- Firebase Firestore Query Performance
- Efficiency of while(true) ServerSocket Listen
- OpenGL core profile incredible slowdown on OS X
- Is there a container similar to `std::deque` but with custom block size and better performance?
- minimize() - iteration time keeps increasing
- StringBuffer and String usage scenario in Java
- How can I accurately benchmark unaligned access speed on x86_64?
- How to analyse apache log on 1 server with 4 cores faster
- Query Execution time in Management Studio & profiler. What does it measure?
- In Julia, how to convert a unsigned number to a signed number like in C?
- harvesine vectorization for vector list
- BIGSERIAL vs SERIAL in PostgreSQL
- Should using "any" be avoided when optimising code in Swift?
- Compute the max sum circular area
- Selecting the maximum number of values from a column that share values in another column in R
- Google Sheets Query to extract data and replace checkboxes with values. (Diff)