fast padded strcpy for a single word

I'm trying to write a very cheap C++ code snippet to do the following operation on a short null terminated string.

The input is a string like "ABC". It is null terminated and has maximum length of 4 (or 5 with the null terminator).

The output goes to a char[4] which is not null terminated and should be space padded on the right. So in this case it would be {'A','B','C',' '}

It is ok to assume that the input string is properly null terminated, so there's no need to read a second word of the input to make sure. 4 bytes is the longest it can be.

So the code around it looks like this:

char* input = "AB";
char output[4];
// code snippet goes here
// afterward output will be populated with {'A','B',' ',' '}

How cheaply can this be done? If it matters: I'm working with:

Linux 2.6.32-358.11.1.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Lastly, the input is word aligned.

Solution

How about something like this:

typedef unsigned int word;
int spacePad(word input) {
    static const word spaces = 0x20202020;

    word mask =
       !input ?                0 :
       !(input & 0x00ffffff) ? 0xff:
       !(input & 0x0000ffff) ? 0xffff :
       !(input & 0x0000ff)   ? 0xffffff :
                               0xffffffff;
    // or without branches
    word branchless_mask =
       1u << (8 * (
         bool(input & 0xff000000) +
         bool(input & 0x00ff0000) +
         bool(input & 0x0000ff00) +
         bool(input & 0x000000ff)
       ));

    return (spaces & mask) | (input & ~mask);
}

And if I didn't screw up, spacePad(0xaabb0000) is 0xaabb2020.

Instead of computing and-masks, you could use SSE intrinsics which would probably be faster since you'd get the mask in a couple of instruction, and then masked move would do the rest, but the compiler would probably move your variables arround from SSE to standard registers which could outweight the slight gain. It all depends on how much data you need to process, how it's packed in memory, etc.

If the input in a char* and not an int, normally additionnal code would be necessary since a cast could read into unallocated memory. But since you mention all strings are word-aligned a cast is enough, indeed even if there are a few unallocated bytes, they are on the same word as at least one allocated byte. Since you are only reading there's no risk of memory corruption and on all architectures I know of, hardware memory protection has a granularity larger than a word. For instance on x86 a memory page is often 4k aligned.

Now that's all nice and hacky, but: before selecting a solution, benchmark it, that's the only way to know which is best for you (except of course the warm fuzzy feeling of writing code like this ^^)