Using an union (encapsulated in a struct) to bypass conversions for neon data types

I made my first approach with vectorization intrinsics with SSE, where there is basically only one data type __m128i. Switching to Neon I found the data types and function prototypes to be much more specific, e.g. uint8x16_t (a vector of 16 unsigned char), uint8x8x2_t (2 vectors with 8 unsigned char each), uint32x4_t (a vector with 4 uint32_t) etc.

First I was enthusiastic (much easier to find the exact function operating on the desired data type), then I saw what a mess it was when wanting to treat the data in different ways. Using specific casting operators would take me forever. The problem is also addressed here. I then came up with the idea of an union encapsulated into a struct, and some casting and assignment operators.

struct uint_128bit_t { union {
        uint8x16_t uint8x16;
        uint16x8_t uint16x8;
        uint32x4_t uint32x4;
        uint8x8x2_t uint8x8x2;
        uint8_t uint8_array[16] __attribute__ ((aligned (16) ));
        uint16_t uint16_array[8] __attribute__ ((aligned (16) ));
        uint32_t uint32_array[4] __attribute__ ((aligned (16) ));
    };

    operator uint8x16_t& () {return uint8x16;}
    operator uint16x8_t& () {return uint16x8;}
    operator uint32x4_t& () {return uint32x4;}
    operator uint8x8x2_t& () {return uint8x8x2;}
    uint8x16_t& operator =(const uint8x16_t& in) {uint8x16 = in; return uint8x16;}
    uint8x8x2_t& operator =(const uint8x8x2_t& in) {uint8x8x2 = in; return uint8x8x2;}

};

This approach works for me: I can use a variable of type uint_128bit_t as an argument and output with different Neon intrinsics, e.g. vshlq_n_u32, vuzp_u8, vget_low_u8 (in this case just as input). And I can extend it with more data types if I need. Note: The arrays are to easily print the content of a variable.

Is this a correct way of proceeding?
Is there any hidden flaw?
Have I reinvented the wheel?
(Is the aligned attribute necessary?)

Solution

Since the initial proposed method has undefined behaviour in C++, I have implemented something like this:

template <typename T>
struct NeonVectorType {

    private:
    T data;

    public:
    template <typename U>
    operator U () {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size");
        U u;
        memcpy( &u, &data, sizeof u );
        return u;
    }

    template <typename U>
    NeonVectorType<T>& operator =(const U& in) {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size");
        memcpy( &data, &in, sizeof data );
        return *this;
    }

};

Then:

typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.

The use of memcpy is discussed here (and here), and avoids breaking the strict aliasing rule. Note that in general it gets optimized away.

If you look at the edit history, I had implemented a custom version with combine operators for vectors of vectors (e.g. uint8x8x2_t). The problem was mentioned here. However, since those data types are declared as arrays (see guide, section 12.2.2) and therefore located in consecutive memory locations, the compiler is bound to treat the memcpy correctly.

Finally, to print the content of the variable one could use a function like this.