I am writing a small compiler and I am having trouble to handle string constant and string object.
Take the following code for example:
s : String = "Hello world"
Since the string is in the program, this will be recognized by the compiler and a string constant will be generated and placed in the .data
segment.
.data
string_const1:
.quad 2
.quad 6
.quad String_dispatch_table
.quad int_const1
.asciz "Hello world"
.align 8
The actually string then can be accessed by:
leaq string_const1(%rip), %rax
addq $32, %rax
However, if we ask users to input a string, then the string object needs to be generated dynamically. Here is the string object template that is also placed in the .data
segment.
String_protoObj:
.quad 2 # tag
.quad 5 # size
.quad String_dispatch_table
.quad 0 # string length
.quad 0 # string content
# assume %rax contains the address of a copy of String_protoObj
# assume %rdi contains the address of user input string
leaq String_protoObj(%rip), %rdi
callq malloc
movq user-input, %rdi
movq %rdi, 32(%rax) # copy the new string into string content
Then later to access the actual string, i have to use
32(%rax) # read from memory
So there is a difference between accessing string from a string constant and a dynamically-allocated string object, which require different handle in all functions
I could obviously add another tag in the protoObject to indicate this is an allocated object instead of a constant but this would require all the method who receives string object/constant to do a check and this does not sound elegant at all.
Can anyone please give me any suggestion of how I can handle this situation well?
Personally, I'd start by making the constant look like a string object, which means that the fifth word will contain a pointer to the sixth word. That's a price you're evidently willing to pay with string objects.
A more space efficient strategy is the one used by most modern C++ libraries, in which there are two string layouts: one with an included character vector (short strings) and the other with a pointer. You can tell them apart from the length so you don't need a different tag but of course you could also use a different tag.
In practice, most strings are reasonably short so this optimization is believed to be useful. But it's more work, and a lot more tests to write, so you might want to save it for later.