Loading XMM registers from address location

I'm trying to load/store a memory from/to a char pointer array using the XMM0 128-bit register on a 32-bit operating system.

What I tried is very simple:

int main() {
    char *data = new char[33];
    for (int i = 0; i < 32; i++)
        data[i] = 'a';
    data[32] = 0;
    ASM
    {
        movdqu xmm0,[data]
    }

    delete[] data;
}

The problem is that this doesn't seem to work. The first time I debugged the Win32 application I got:

xmm0 = 0024F8380000000000F818E30055F158

The second time I debugged it I got:

xmm0 = 0043FD6800000000002C18E3008CF158

So there must be something with the line:

movdqu xmm0,[data]

I tried using this instead:

movdqu xmm0,data

but I got the same result.

What I thought was the problem is that I copy the address instead of the data at the address. However the value shown at the xmm0 register is too large for a 32-bit address, so it must be copying memory from another address.

I also tried some other instructions I found at the internet, but with the same result.

Is it the way I'm passing the pointer or am I misunderstanding something about xmm basics?

A valid solution with an explanation will be appreciated.

Even though I found the solution (finally after three hours), I would still like an explanation:

ASM
    {
        push eax
        mov eax,data
        movdqu xmm0,[eax]
        pop eax
    }

Why should I pass the pointer to a 32-bit register?

Solution

#include <iostream>

int main()
{
    char *dataptr = new char[33];
    char datalocal[33];
    dataptr[0] = 'a';   dataptr[1] = 0;
    datalocal[0] = 'a'; datalocal[1] = 0;
    printf("%p %p %c\n", dataptr, &dataptr, dataptr[0]);
    printf("%p %p %c\n", datalocal, &datalocal, datalocal[0]);
    delete[] dataptr;
}

Output:

0xd38050 0x7635bd709448 a
0x7635bd709450 0x7635bd709450 a

As we can see, the dynamic pointer data is really a pointer variable (32 bits or 64 bits at 0x7635BD709448), containing a pointer to the heap, 0xD38050.

The local variable is directly a 33 characters long buffer, allocated at address 0x7635BD709450.

But the datalocal works also as a char * value.

I'm a bit confused what the formal C++ explanation of this is. While writing C++ code, this feels quite natural and dataptr[0] is the first element in the heap memory (that is, dereferencing dataptr twice), but in assembler you see the true nature of dataptr, which is address of the pointer variable. So you have first to load the heap pointer by mov eax,[data] = loads eax with 0xD38050, and then you can load the content of 0xD38050 into XMM0 by using [eax].

With a local variable there is no variable with the address of it; the symbol datalocal is already the address of the first element, so movdqu xmm0,[data] will work then.

In the "wrong" case you can still do movdqu xmm0,[data]; it's not a problem of the CPU to load 128 bits from a 32-bit variable. It will simply continue reading beyond the 32 bits and read another 96 bits belonging to other variables/code. In case you are around a memory boundary and this is the last memory page of the application, it will crash on an invalid access.

Alignment were mentioned a few times in comments. That's a valid point; to access the memory through movdqu it should be aligned. Check your C++ compiler intrinsics. For Visual Studio this should work:

__declspec(align(16)) char datalocal[33];
char *dataptr = _aligned_malloc(33, 16);
_aligned_free(dataptr);

About my C++ interpretation: Maybe I got this wrong since the beginning.

The dataptr is the value of the dataptr symbol, that is, that heap address. Then dataptr[0] is dereferencing the heap address, accessing the first element of the allocated memory. &dataptr is the address of the dataptr value. This makes sense also with syntax like dataptr = nullptr;, where you are storing the nullptr value into the dataptr variable, not overwriting the dataptr symbol address.

With datalocal[] there's basically no sense in accessing the pure datalocal, like in datalocal = 'a';, as it's an array variable, so you should always provide the [] index. And &datalocal is the address of such an array. The pure datalocal is then an aliased shortcut for easier point math with arrays, etc., having also the char * type, but if the pure datalocal would throw a syntax error, it would still be possible to write C++ code (using &datalocal for pointer, datalocal[..] for elements), and it would fit with that dataptr logic completely.

Conclusion: You had your example wrong since the beginning, because in assembly language [data] is loading the value of data, which is the pointer to the heap returned by new.

This is my own explanation, and now some C++ expert will come and tear it to pieces from a formal point of view... :)))