Endianness Of OpenCL devices

Apart from precision / accuracy / alignment inconsistencies between a gpu, a cpu and a customized opencl chip, why all opencl examples over internet directly accept host-side buffers?

clEnqueueWriteBuffer (  cl_command_queue command, ...)

__kernel void vecAdd(__global float * c, ..)
{
   int id=get_global_id(0);
   c[id]=b[id]+a[id]
}

lets assume host is little endian but device is big endian and a,b,c, are floats:

Would b and a be loaded as non-normalized garbage?

if yes, do I have to send a single byte telling kernel about buffer endianness?
if no, are they converted(using 4 byte SIMD shuffling of CPU? Or at GPU?) during buffer read-write operations automatically? Even for USE_HOST_PTR(RAM) type buffers and CL_MEM_READ_WRITE(GDDR)?

If unknown, would below example functions work always?

    int endianness_of_device()
    {
            unsigned int x = 1;
            return ((  (  (char *)&x) [0])==0?0:1) ;
    }

    int reverseBytesInt(int bytes)
    {
        void * p = &bytes;
        uchar4 bytesr = ((uchar4 *)p)[0].wzyx;
        int * p2=(int *)&bytesr;
        return *p2;
    }

    float reverseBytesFloat(float bytes)
    {
        void * p = &bytes;
        uchar4 bytesr = ((uchar4 *)p)[0].wzyx;
        float * p2=(float *)&bytesr;
        return *p2;
    }

    uint sizeOf2(uint structSize)
    {
        uit mult0=structSize/256;mult0++;
        return mult0*256;
    }

    typedef struct
    {
        uchar end;
        uchar sizeRelByteAdr;
        uchar adrRelByteAdr;
        uchar idRelByteAdr;
        uchar valRelByteAdr;
        uchar woRelByteAdr;
        int size;
        uint adr; 
        int id;
        float val;
        float wo;
    }nn_p_n;




    uint malloc(__global uchar * heap, __global uint * mallocCtr, int size)
    {
        return (atomic_add(mallocCtr,(uint)size)+(uint)heap);
    }

to help kernels like:

           __kernel void nn( __global uchar *heap,__global uint * mallocCtr)
                        {
                            int id=get_global_id(0);
                            if(id==0)
                            {

                                nn_p_n * np=(nn_p_n *)malloc(heap,mallocCtr,sizeOf2(sizeof(nn_p_n)));
                                np->end=endianness_of_device();
                                np->size=sizeOf2(sizeof(nn_p_n));
                                np->id=9;
                                np->val=99.9f;

                                // lets simulate different endianness
                                np->end=1-endianness_of_device();


                                np->adr=(uint)np-(uint)heap;

                                mem_fence(CLK_GLOBAL_MEM_FENCE);
                            }

                            if(id==900)
                            {
                                // simulating another device reads buffer
                                for(int i=0;i<1000000;i++){int dummy=0; dummy++; if(dummy>id) dummy++;}

                                nn_p_n n=*((nn_p_n *)&heap[0]);
                                if(n.end!=endianness_of_device())
                                {
                                    n.size=reverseBytesInt(n.size);
                                    //if(n.size!=sizeof2(sizeof(nn_p_n)))
                                    //{  return; }
                                    n.adr=reverseBytesInt(n.adr);
                                    n.val=reverseBytesFloat(n.val);
                                }
                                nn_p_n * np=(nn_p_n *)malloc(heap,mallocCtr,sizeOf2(sizeof(nn_p_n)));
                                *np = n;
                            }

                        }

because it is working for my Intel igpu with an Intel CPU at the moment and other machine with Amd CPU and Amd GPU without any endianness problems. What if I get Nvidia gpu and Amd gpu on top of Intel CPU in future?

Ofcourse in cluster computing, one needs to cover endianness cases between computers but what about a virtual os running in another os and using same device? What about that device is an fpga having multiple endianness cores(possible?)?

Last question: can an OS force all devices, even the CPU, to become same endianness?(I don't think so but could be emulated by OS at a cost of performance?)

Edit: It's impossible to preprocess a read-only data on device-side. It is overkill to postprocess on host side those elements which are marked as "written" because only 1-2 elements could have been written while whole data could be gigabytes to download to host.

Solution

Arguments passed to kernels are guaranteed to have the correct endianness (hence all the typedefs cl_int, etc), but this is not the case for buffers. This makes sense, because the contents of buffers are completely opaque for OpenCL: only the user knows how to make sense of what's inside. Hence, it's the user's responsibility to perform potential endianness conversion before doing computations (possibly by launching a dedicated kernel).

In other words:

__kernel void vecAdd(__global float * c, ..)

Here, the value of c is guaranteed to be of the correct endianness (the pointer's bytes themselves are in the correct device order), but the bytes pointed to by c are in whatever order the user set them on the host.

why all opencl examples over internet directly accept host-side buffers?

Most programs are developed with a fairly narrow set of target platforms, where characteristics are known in advance: if they're all little-endian, why bother supporting big-endian? Portability is already a difficult problem, and I suspect that caring about endianness is, in the general case, a significant additional complexity for little added value. It simply isn't worth it for the vast majority of programs.

If you consider such a level of portability valuable, then it's your choice to implement it.