Is ntohl in x86 assembly necessary?

My goal is to send a integer value over TCP as little endian bytes to a client on windows, the client is written in x86 assembly.

If I were to send htonl() encoded bytes to the client would it be necessary being the client is compiled as x86? For instance wouldn't it be redundant to call ntohl() within my client's assembly code?

My overarching question is do I need to call htonl() server-side and ntohl() client side (x86 windows client)? Or should I just let the server do the work by checking if the server's architecture is big endian and if so swap the integer bytes via __builtin_bswap32() and send the little endian bytes to the client? I'm asking because I've read x86 is always little endian so it seems redundant if I know the client is always going to be written in x86 assembly.

Solution

My overarching question is do I need to call htonl() server-side and ntohl() client side (x86 windows client)?

No, that would convert to big-endian ("network" byte order), but you said you wanted to send data over the network in little-endian format. On x86, that already is the h ("host") order.

In x86 asm, your data in memory will already be little-endian integers / floats unless you did something unusual (like using bswap, movbe, or pshufb, or byte-at-a-time shift / store.)

To be compatible with that in C, use le32toh (on receive) and htole32 (before send) from GCC's / BSD <endian.h> instead of ntohl / htonl. i.e. use LE as your network format instead of the traditional BE.

call ntohl() within my client's assembly code?

That would be insane. If you did want that, just use the bswap or movbe instructions instead of actually setting up args for a function call. Normally those functions inline when you use them in C, although there is a stand-alone definition of ntohl in libc.

Also, no, you wouldn't want to do that. Your client doesn't want to have anything to do with big-endian, which is what those traditional functions call "network" byte order.

x86 asm with AVX2 vpshufb can byte-swap at memcpy speed (including on small buffers that fit in L1d cache), but it's even more efficient not to have to swap at all as part of the first step that reads the data.