Search code examples
stringassemblynullx86masm

What is the difference between a null terminated string and a string that is not terminated by null in x86 assembly language


I'm currently learning assembly programming by following Kip Irvine's "assembly language x86 programming" book.

In the book, the author states

The most common type of string ends with a null byte (containing 0). Called a null-terminated string

In the subsequent section of the book, the author had a string example without the null byte

greeting1 \
BYTE "Welcome to the Encryption Demo program "

So I was just wondering, what is the difference between a null terminated string and a string that is not terminated by null in x86 assembly language? Are they interchangeable? Or they are not equivalent of each other?


Solution

  • There's nothing specific to asm here; it's the same issue in C. It's all about how you store strings in memory and keep track of where they end.

    what is the different between a null terminated string and a string that is not terminated by null?

    A null-terminated string has a 0 byte (ASCII NUL) after it, so you can find the end with strlen. (e.g. with a slow repne scasb). This is an implicit-length 0-terminated string, like C uses.

    NASM Assembly - what is the ", 0" after this variable for? explains the NASM syntax for creating one in static storage with db. db usage in nasm, try to store and print string shows what happens when you forget the 0 terminator.

    Are they interchangeable?

    If you know the length of a null-terminated string, you can pass pointer+length to a function that wants an explicit-length string. That function will never look at the 0 byte, because you will pass a length that doesn't include the 0 byte. It's not part of the string data proper.

    But if you have a string without a terminator, you can't pass it to a function or system-call that wants a null-terminated string. (If the memory is writeable, you could store a 0 after the string to make it into a null-terminated string. But only if the string data doesn't include any 0 bytes e.g. UTF-16 for English characters; explicit-length strings can contain any byte values, implicit-length strings can't contain their terminator.)


    In Linux, many system calls take strings as C-style implicit-length null-terminated strings, usually for paths. (i.e. just a const char* without passing an integer length).

    For example, open(2) takes a string for the path: int open(const char *pathname, int flags); You must pass a null-terminated string to the system call. It's impossible to create a file with a name that includes a '\0' in Linux (same as most other Unix systems), because all the system calls for dealing with files use null-terminated strings.

    OTOH, write(2) takes a memory buffer which isn't necessarily a string. It has the signature ssize_t write(int fd, const void *buf, size_t count);. It doesn't care if there are zeroes anywhere in or after the buffer, it just copies from buf to buf+count-1, not even looking at buf+count.

    You can pass a string to write(). It doesn't care. It's basically just a memcpy into the kernel's pagecache (or into a pipe buffer or whatever for non-regular files). But like I said, you can't pass an arbitrary non-terminated buffer as the path arg to open().

    Or they are not equivalent of each other?

    Implicit-length and explicit-length are the two major ways of keeping track of string data/constants in memory and passing them around. They solve the same problem, but in opposite ways.

    Long implicit-length strings are a bad choice if you sometimes need to find their length before walking through them. Looping through a string is a lot slower than just reading an integer. Finding the length of an implicit-length string is O(n), but an explicit-length string is of course O(1) time to find the length. (It's already known!). At least the length in bytes is known, but the length in Unicode characters might not be known, if it's in a variable-length encoding like UTF-8 or UTF-16.

    Compilers also have a much easier time auto-vectorizing with SIMD for loops whose length is known before the first iteration runs. So traditional C-string algorithms need to be manually vectorized if you want them to run fast on modern CPUs.