Why is SSE alignement necessary while doing SIMD instructions?

I am new to C++, I only have 1.5 years of experience with that language.

I have to work with a library that has premade data structures, and it offers a way to make our own data structure following certain rules in order to adapt it with the library.

This is the PCL library. The data structure I am talking about is the Point Type

One of these "rules" is to SSE allign the data of the point type on 16 bytes (I think this is 16 bytes). But I don't understand why.

I have to make weird unions and structures to make it. Why can't I only make a simple structure and put every float I need in ?

I saw that SSE alignement is strongly recommended for SIMD instructions, I suspect the PCL library to use it. Are SIMD instructions useful ?

Solution

SIMD means "single instruction multiple data".

Modern computers have a number of ways to do more than one thing at once. There are physics limitations that make building computers that run much faster than 5 GHz difficult. So modern computers have instead gotten better at doing more than one thing at a time, rather than running one set of instructions faster.

To harness that, we need to do more than one thing at a time in our computer programs.

One way to do more than one thing at once is with multiple processes -- programs -- running at once.

Another is with threads within the program, where each thread has its own instructions and data.

CPU pipelining of instructions happens in a single thread. In it, some of the work required for each instruction is done in overlapping ways. Depending on the architecture, the machine code may or may not have to know about these delays; in x64 AMD/Intel, typically the CPU "stalls" computation if its output is needed before the next instruction. Compilers attempt to avoid such stalls.

SIMD is another way to do more than one thing at once. It is also called vectorization. SIMD has the same instruction running on multiple pieces of data. So if you have a bunch of mathematical vectors (each with multiple components: say, x,y,z,w) you want to add up piece-wise, a single SIMD instruction can add the xs, ys, zs and ws separately all at the same time.

SIMD instructions often require that your data be aligned in a certain way in memory. For a 128 SIMD instruction on 4 32 bit integers, usually it requires that the address used is a multiple of 128 bits (or 16 bytes) -- the lowest 4 bits in the address must be 0.

SIMD instructions are best used on large buffers of data, because they are pipelined. So the cost of aligning your data is low, and the benefit in the CPU is high.

In some architectures, even non-SIMD data needs to be aligned, and often it being aligned makes it faster to read.

SIMD instructions can be many times faster than doing it naively. Modern SIMD instructions are sometimes 512 bytes wide, and approach the speed of doing a single instruction on a single 16 or 8 or 32 bit value; so they could make a program 10x faster; this SO blog post has one example of a more than 10x speedup.

Of course, that is an ideal situation. Often the boost is smaller, but even a 2x speedup can be significant to the user's experience.

(Aside: The above contains some simplifying "lies to children" - mostly true, but details are not exact, etc. As will almost any discussion of these subjects short of an up to date university course or similar.)