Search code examples
compiler-construction

Compiler: String constant vs dynamically-allocated String object


I am writing a small compiler and I am having trouble to handle string constant and string object.

Take the following code for example:

s : String = "Hello world"

Since the string is in the program, this will be recognized by the compiler and a string constant will be generated and placed in the .data segment.

.data

string_const1:
    .quad    2
    .quad    6
    .quad    String_dispatch_table
    .quad    int_const1
    .asciz    "Hello world"
    .align    8

The actually string then can be accessed by:

leaq string_const1(%rip), %rax
addq $32, %rax

However, if we ask users to input a string, then the string object needs to be generated dynamically. Here is the string object template that is also placed in the .data segment.

String_protoObj:
    .quad    2                        # tag              
    .quad    5                        # size
    .quad    String_dispatch_table    
    .quad    0                        # string length
    .quad    0                        # string content

# assume %rax contains the address of a copy of String_protoObj
# assume %rdi contains the address of user input string 

leaq String_protoObj(%rip), %rdi
callq malloc
movq user-input, %rdi
movq %rdi, 32(%rax)   # copy the new string into string content

Then later to access the actual string, i have to use

32(%rax)  # read from memory

So there is a difference between accessing string from a string constant and a dynamically-allocated string object, which require different handle in all functions

I could obviously add another tag in the protoObject to indicate this is an allocated object instead of a constant but this would require all the method who receives string object/constant to do a check and this does not sound elegant at all.

Can anyone please give me any suggestion of how I can handle this situation well?


Solution

  • Personally, I'd start by making the constant look like a string object, which means that the fifth word will contain a pointer to the sixth word. That's a price you're evidently willing to pay with string objects.

    A more space efficient strategy is the one used by most modern C++ libraries, in which there are two string layouts: one with an included character vector (short strings) and the other with a pointer. You can tell them apart from the length so you don't need a different tag but of course you could also use a different tag.

    In practice, most strings are reasonably short so this optimization is believed to be useful. But it's more work, and a lot more tests to write, so you might want to save it for later.