simdadaintrinsicsavx2gnat# How would I define the __m256i data type in Ada?

I am trying to write a library for AVX2 in Ada 2012 using the GNAT GCC compiler. I have currently defined a data type Vec_256_Integer_32 like so:

```
type Vector_256_Integer_32 is array (0 .. 7) of Integer_32;
pragma Pack(Vec_256_Integer_32);
```

Note that I have aligned the array according to the 32 byte boundary indicated in Intel's documentation of the `_mm256_load_si256`

intrinsic function from `immintrin.h`

.

I would like to implement an operation that adds two of these arrays together using AVX2. The function prototype is as follows.

```
function Vector_256_Integer_32_Add (Left, Right : Vector_256_Integer_32) return Vector_256_Integer_32
```

My idea for implementing this function is to do this in three steps.

- Load a and b using
`_mm256_load_si256`

into a local variable. - Perform the addition operation using
`_mm256_add_epi32`

. - Convert the result back into the
`Vec_256_Unsigned_32`

type using`_mm256_store_si256`

.

Where I am confused is how I would create the __m256i data type in Ada to hold the intermediate results. Can someone please shed some light on this? Additionally, if you see any issues with my approach, any feedback is appreciated.

I have found the definition of __m256i in GCC (located at gcc/gcc/config/i386/avxintrin.h).

```
typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__));
```

However, here is where I am stuck as I am not sure how I would transfer this to Ada code.
I have found that the `__vector_size__`

attribute is documented here.

Solution

I figured out the answer to my question after doing more research. Thank you for your input. I am posting this so hopefully someone else can get value from this.

Edit: I have adjusted my answer according to feedback from the commenter Peter Cordes.

For example, if you want to define a data type of 8 32-bit signed integers, you would write

```
type Vector_256_Integer_32 is array (0 .. 7) of Integer_32 with Convention => C, Alignment => 32;
```

The function to add the two vectors together would be defined as

```
function "+" (Left, Right: Vector_256_Integer_32) return Vector_256_Integer_32;
pragma Import (Intrinsic, "+", "__builtin_ia32_paddd256");
```

Note that I am using the GCC intrinsic, rather than the intrinsics from immintrin.h (because I am not aware how to import an intrinsic from that header file).

The documentation of `_mm256_add_epi32`

states that the `vpaddd`

instruction is used. The GCC `__builtin_ia32_paddd256`

appears to translate to this instruction.

Below is an example Ada program and ads file.

avx2.ads

```
with Interfaces; use Interfaces;
package AVX2 is
--
-- Type Definitions
--
-- 256-bit Vector of 32-bit Signed Integers
type Vector_256_Integer_32 is array (0 .. 7) of Integer_32;
for Vector_256_Integer_32'Alignment use 32;
pragma Machine_Attribute (Vector_256_Integer_32, "vector_type");
pragma Machine_Attribute (Vector_256_Integer_32, "may_alias");
--
-- Function Definitions
--
-- Function: 256-bit Vector Addition of 32-bit Signed Integers
function Vector_256_Integer_32_Add
(Left, Right : Vector_256_Integer_32) return Vector_256_Integer_32 with
Convention => Intrinsic, Import => True,
External_Name => "__builtin_ia32_paddd256";
end AVX2;
```

main.adb

```
with AVX2; use AVX2;
with Interfaces; use Interfaces;
with Ada.Text_IO; use Ada.Text_IO;
procedure Main is
a, b, r : Vector_256_Integer_32;
begin
for i in Vector_256_Integer_32'Range loop
a (i) := 5 * (Integer_32 (i) + 5);
b (i) := 12 * (Integer_32 (i) + 12);
end loop;
r := Vector_256_Integer_32_Add(a, b);
for i in Vector_256_Integer_32'Range loop
Put_Line
("r(i) = a(i) + b(i) = " & a (i)'Image & " + " & b (i)'Image & " = " &
r (i)'Image);
end loop;
end Main;
```

Here is an equivalent program in C. Note that this code has only been tested in GCC and is not necessarily the most efficient.

```
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
int main()
{
__m256i ma;
__m256i mb;
__m256i mr;
int32_t a[8] __attribute__((aligned(32)));
int32_t b[8] __attribute__((aligned(32)));
int32_t r[8] __attribute__((aligned(32)));
for (int i = 0; i < 8; ++i) {
a[i] = 5 * (i + 5);
b[i] = 12 * (i + 12);
}
ma = _mm256_load_si256((void *const)a);
mb = _mm256_load_si256((void *const)b);
mr = _mm256_add_epi32(ma, mb);
_mm256_store_si256((void *)r, mr);
for (int i = 0; i < 8; ++i) {
printf("r[i] = a[i] + b[i] = %d + %d = %d\n", a[i], b[i], r[i]);
}
}
```

- Is using AVX2 can implement a faster processing of LZCNT on a word array?
- Dot product performance with SSE instructions: is DPPS worth using?
- simd find first element greater than x
- Reducing NEON vector with variable amounts of bits in each element into a single 32-bit value (concatenate variable-length bitfields)
- Why does GCC generate code that conditionally executes a SIMD implementation?
- Is there an ARM Neon Gather Instruction?
- Why can't clang vectorise this loop over a std::span, writing results to a std::array?
- ARM64 ASIMD intrinsic to load uint8_t* into uint16x8(x3)?
- Is there any performance difference between AVX-512 `_mm512_load_epi64` and `_mm512_loadu_epi64`?
- Loop unrolling, Memory Access, and Recursive Throughput
- how can I use SVML instructions
- How many float multiplies can be performed with a single core of the current Intel architectures?
- Fastest way to mask out bytes higher than separator position with SIMD
- C++ error: ‘_mm_sin_ps’ was not declared in this scope
- AVX2: Computing dot product of 512 float arrays
- SSE multiplication of 4 32-bit integers
- Is there an efficient way to get the first non-zero element in an SIMD register using SIMD intrinsics?
- Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?
- Do all CPUs which support AVX2 also support SSE4.2 and AVX?
- How to convert a binary integer number to a hex string?
- _mm256_insert_epi32() has no effect
- _mm_testc_ps and _mm_testc_pd vs _mm_testc_si128
- what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256
- Find the first instance of a character using simd
- AVX2 narrowing conversion, from uint16_t to uint8_t
- What are the 128-bit to 512-bit registers used for?
- A way to ensure std::vector is always aligned for optimal SIMD execution?
- Why is a simple FP loop not auto-vectorized, and slower than a SIMD intrinsics calculation?
- In Linux kernel, why zero out the task->thread.sve_state when handling a SVE exception trap?
- Why performance for this index-of-max function over many arrays of 256 bytes is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X?