Search code examples
cstringassemblyendiannesstype-punning

Is it possible to cast a string into its integer/long representation in C?


Upon decompiling various programs (which I do not have the source for), I have found some interesting sequences of code. A program has a c-string (str) defined in the DATA section. In some function in the TEXT section, a part of that string is set by moving a hexadecimal number to a position in the string (simplified Intel assembly MOV str,0x006f6c6c6568). Here is an snippet in C:

#include <stdio.h>

static char str[16];

int main(void)
{
    *(long *)str = 0x006f6c6c6568;
    printf("%s\n", str);
    return 0;
}

I am running macOS, which uses little endian, so 0x006f6c6c6568 translates to hello. The program compiles with no errors or warnings, and when run, prints out hello as expected. I calculated 0x006f6c6c6568 by hand, but I was wondering if C could do it for me. Something like this is what I mean:

#include <stdio.h>

static char str[16];

int main(void)
{
    // *(long *)str = 0x006f6c6c6568;
    *(str+0) = "hello";
    printf("%s\n", str);
    return 0;
}

Now, I would not like to treat "hello" as a string literal, it might be treated like this for little-endian:

    *(long *)str = (long)(((long)'h') |
                          ((long)'e' << 8) |
                          ((long)'l' << 16) |
                          ((long)'l' << 24) |
                          ((long)'o' << 32) |
                          ((long)0 << 40));

Or, if compiled for a big-endian target, this:

    *(long *)str = (long)(((long) 0  << 16) |
                          ((long)'o' << 24) |
                          ((long)'l' << 32) |
                          ((long)'l' << 40) |
                          ((long)'e' << 48) |
                          ((long)'h' << 56));

Thoughts?


Solution

  • TL:DR: you want strncpy into a uint64_t. This answer is long in an attempt to explain the concepts and how to think about memory from C vs. asm perspectives, and whole integers vs. individual chars / bytes. (i.e. if it's obvious that strlen/memcpy or strncpy would do what you want, just skip to the code.)


    If you want to copy exactly 8 bytes of string data into an integer, use memcpy. The object-representation of the integer will then be those string bytes.

    Strings always have the first char at the lowest address, i.e. a sequence of char elements so endianness isn't a factor because there's no addressing within a char. Unlike integers where it's endian-dependent which end is the least-significant byte.

    Storing this integer into memory will have the same byte order as the original string, just like if you'd done memcpy to a char tmp[8] array instead of a uint64_t tmp. (C itself doesn't have any notion of memory vs. register; every object has an address except when optimization via the as-if rule allows, but assigning to some array elements can get a real compiler to use store instructions instead of just putting the constant in a register. So you could then look at those bytes with a debugger and see they were in the right order. Or pass a pointer to fwrite or puts or whatever.)

    memcpy avoids possible undefined behaviour from alignment and strict-aliasing violations from *(uint64_t*)str = val;. i.e. memcpy(str, &val, sizeof(val)) is a safe way to express an unaligned strict-aliasing safe 8-byte load or store in C, like you could do easily with mov in x86-64 asm.
    (GNU C also lets you typedef uint64_t aliasing_u64 __attribute__((aligned(1), may_alias)); - you can point that at anything and read/write through it safely, just like with an 8-byte memcpy.)

    char* and unsigned char* can alias any other type in ISO C, so it's safe to use memcpy and even strncpy to write the object-representation of other types, especially ones that have a guaranteed format / layout like uint64_t (fixed width, no padding, if it exists at all).


    If you want shorter strings to zero-pad out to the full size of an integer, use strncpy. On little-endian machines it's like an integer of width CHAR_BIT * strlen() being zero-extended to 64-bit, since the extra zero bytes after the string go into the bytes that represent the most-significant bits of the integer.

    On a big-endian machines, the low bits of the value will be zeros, as if you left-shifted that "narrow integer" to the top of the wider integer. (And the non-zero bytes are in a different order wrt. each other).
    On a mixed-endian machine (e.g. PDP-11), it's less simple to describe.

    strncpy is bad for actual strings but exactly what we want here. It's inefficient for normal string-copying because it always writes out to the specified length (wasting time and touching otherwise unused parts of a long buffer for short copies). And it's not very useful for safety with strings because it doesn't leave room for a terminating zero with large source strings.
    But both of those things are exactly what we want/need here: it behaves like memcpy(val, str, 8) for strings of length 8 or higher, but for shorter strings it doesn't leave garbage in the upper bytes of the integer.

    Example: first 8 bytes of a string

    #include <string.h>
    #include <stdint.h>
    
    uint64_t load8(const char* str)
    {
        uint64_t value;
        memcpy(&value, str, sizeof(value));     // load exactly 8 bytes
        return value;
    }
    
    uint64_t test2(){
        return load8("hello world!");  // constant-propagation through it
    }
    

    This compiles very simply, to one x86-64 8-byte mov instruction using GCC or clang on the Godbolt compiler explorer.

    load8:
            mov     rax, QWORD PTR [rdi]
            ret
    
    test2:
            movabs  rax, 8031924123371070824  # 0x6F77206F6C6C6568 
              # little-endian "hello wo", note the 0x20 ' ' byte near the top of the value
            ret
    

    On ISAs where unaligned loads just work with at worst a speed penalty, e.g. x86-64 and PowerPC64, memcpy reliably inlines. But on MIPS64 you'd get a function call.

    # PowerPC64 clang(trunk) -O3
    load8:
            ld 3, 0(3)            # r3 = *r3   first arg and return-value register
            blr
    

    BTW, I used sizeof(value) instead of 8 for two reasons: first so you can change the type without having to manually change a hard-coded size.

    Second, because a few obscure C implementations (like modern DSPs with word-addressable memory) don't have CHAR_BIT == 8. Often 16 or 24, with sizeof(int) == 1 i.e. the same as a char. I'm not sure exactly how the bytes would be arranged in a string literal, like whether you'd have one character per char word or if you'd just have an 8-letter string in fewer than 8 chars, but at least you wouldn't have undefined behaviour from writing outside a local variable.

    Example: short strings with strncpy

    // Take the first 8 bytes of the string, zero-padding if shorter
    // (on a big-endian machine, that left-shifts the value, rather than zero-extending)
    uint64_t stringbytes(const char* str)
    {
        // if (!str)  return 0;   // optional NULL-pointer check
        uint64_t value;           // strncpy always writes the full size (with zero padding if needed)
        strncpy((char*)&value, str, sizeof(value)); // load up to 8 bytes, zero-extending for short strings
        return value;
    }
    
    uint64_t tests1(){
        return stringbytes("hello world!");
    }
    uint64_t tests2(){
        return stringbytes("hi");
    }
    
    tests1():
            movabs  rax, 8031924123371070824     # same as with memcpy
            ret
    tests2():
            mov     eax, 26984        # 0x6968 = little-endian "hi"
            ret
    

    The strncpy misfeatures (that make it not good for what people wish it was designed for, a strcpy that truncates to a limit) are why compilers like GCC warn about these valid use-cases with -Wall. That and our non-standard use-case, where we want truncation of a longer string literal just to demo how it would work. That's not strncpy's fault, but the warning about passing a length limit the same as the actual size of the destination is.

    n function 'constexpr uint64_t stringbytes2(const char*)',
        inlined from 'constexpr uint64_t tests1()' at <source>:26:24:
    <source>:20:12: warning: 'char* strncpy(char*, const char*, size_t)' output truncated copying 8 bytes from a string of length 12 [-Wstringop-truncation]
       20 |     strncpy(u.c, str, 8);
          |     ~~~~~~~^~~~~~~~~~~~~
    <source>: In function 'uint64_t stringbytes(const char*)':
    <source>:10:12: warning: 'char* strncpy(char*, const char*, size_t)' specified bound 8 equals destination size [-Wstringop-truncation]
       10 |     strncpy((char*)&value, str, sizeof(value)); // load up to 8 bytes, zero-extending for short strings
          |     ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    

    Big-endian examples: PowerPC64

    Strangely, GCC for MIPS64 doesn't want to inline strnlen, and PowerPC can more efficiently construct constants larger than 32-bit anyway. (Fewer shift instructions, as oris can OR into bits [31:16], i.e. OR a shifted immediate.)

    uint64_t foo = tests1();
    uint64_t bar = tests2();
    

    Compiling as C++ to allow function return values as initializers for global vars, clang (trunk) for PowerPC64 compiles the above with constant-propagation into initialized static storage in .data for these global vars, instead of calling a "constructor" at startup to store into the BSS like GCC unfortunately does. (It's weird because GCC's initializer function just constructs the value from immediates itself and stores.)

    foo:
            .quad   7522537965568948079             # 0x68656c6c6f20776f
                                          # big-endian "h e l l o   w o"
    
    bar:
            .quad   7523544652499124224             # 0x6869000000000000
                                          # big-endian "h i \0\0\0\0\0\0"
    

    The asm for tests1() can only construct a constant from immediates 16 bits at a time (because an instruction is only 32 bits wide, and some of that space is needed for opcodes and register numbers). Godbolt

    # GCC11 for PowerPC64 (big-endian mode, not power64le)  -O3 -mregnames 
    tests2:
            lis %r3,0x6869    # Load-Immediate Shifted, i.e. big-endian "hi"<<16
            sldi %r3,%r3,32   # Shift Left Doubleword Immediate  r3<<=32 to put it all the way to the top of the 64-bit register
              # the return-value register holds 0x6869000000000000
            blr               # return
    
    tests1():
            lis %r3,0x6865        # big-endian "he"<<16
            ori %r3,%r3,0x6c6c    # OR Immediate producing "hell"
            sldi %r3,%r3,32       # r3 <<= 32
            oris %r3,%r3,0x6f20   # r3 |=  "o " << 16
            ori %r3,%r3,0x776f    # r3 |=  "wo"
              # the return-value register holds 0x68656c6c6f20776f
            blr
    

    I played around a bit with getting constant-propagation to work for an initializer for a uint64_t foo = tests1() at global scope in C++ (C doesn't allow non-const initializers in the first place) to see if I could get GCC to do what clang does. No success so far. And even with constexpr and C++20 std::bit_cast<uint64_t>(struct_of_char_array) I couldn't get g++ or clang++ to accept uint64_t foo[stringbytes2("h")] to use the integer value in a context where the language actually requires a constexpr, rather than it just being an optimization. Godbolt.

    IIRC std::bit_cast should be able to manufacture a constexpr integer out of a string literal but there might have been some trick I'm forgetting; I didn't search for existing SO answers yet. I seem to recall seeing one where bit_cast was relevant for some kind of constexpr type-punning.


    Credit to @selbie for the strncpy idea and the starting point for the code; for some reason they changed their answer to be more complex and avoid strncpy, so it's probably slower when constant-propagation doesn't happen, assuming a good library implementation of strncpy that uses hand-written asm. But either way still inlines and optimizes away with a string literal.

    Their current answer with strnlen and memcpy into a zero-initialized value is exactly equivalent to this in terms of correctness, but compiles less efficiently for runtime-variable strings.