Search code examples
cstringcharacterstring-literalsinteger-arithmetic

How does "str" - "str" in C work? How are they stored?


Disclaimer: This question asks how "str literal" + "str literal" works

For how 'a' + 'b' or '9' - '0' = 9 ('character' + 'character') works :


Question:

To everyone who's more familiar with C, thanks for reading

(compiled with clang, standard=C11)

Example:

(trying to print __FILE__ without its ".c" extension)

printf("%s\n", __FILE__); returns filename.c

printf("%.*s\n", (int)(".c" - __FILE__), __FILE__); returns filename

1. How does C typecast string/string literals to int? Are whitespaces ignored?

  • What does the value of an (int)"string" represent?

Another example:

(int)("word" - "rd") = 6273
(int)("rd" - "word") = -6273
(int)("word" - "  rd") = -5
(int)("  rd" - "word") = -5

Why does (int)(".c" - __FILE__) even work?

3. Is the printf function above actually working?

4. Is there a string equivalent to 'a' + 1 = 'b' ?

Thanks in advance guys!



irrelevant guessing:

1 Why does (int)(".c" - __FILE__) even work?

guessing its

  some value of (first?) pointer to ".c" 
- some value of (first?) pointer to __FILE__ string literal 

2. What does the value of (int)"string" actually represent?

  • Why does (int)(".c" - __FILE__) even work?

idk but here's another example:

printf("%i", (int)".c");
printf("%i", (int)__FILE__);
printf("%i", (int)(".c" - __FILE__));
printf("%i", (int)(__FILE__ - ".c"));
printf("%./*i", (int)(".c" - __FILE__), (int)__FILE__);
printf("%./*i", (int)(".c" - __FILE__), (int)("c" - __FILE__));

output
---------------------
(int) ".c"= 4357629
(int) __FILE__= 4357620
(int) (".c" - __FILE__) : 9
(int) (__FILE__ - ".c"): -9
(int with precision specified) __FILE__ : 004357620
(int with precision specified) (".c" - __FILE__): 000000009 
$

3. Is printf actually working?

Assuming it does, probably:

printf("%.*s",(int)(".c" - __FILE__), __FILE__)
    width = (int)(".c" - __FILE__)   
    specifier/str = __FILE__

printf prints out __FILE__ as a string of width (".c" - __FILE__) (two characters less)


Solution

  • There are three things happening in your examples.

    Firstly in C pointer arithmetic rules are such that two pointers may be subtracted to yield the difference in address between the two pointers. So for example:

    char test[2] ;
    char* t1 = &test[0] ;
    char* t2 = &test[1] ;
    ptrdiff_t d = t2 - t1  ; // d == 1
    

    Where ptrdiff_t is an integer type capable of holding the difference between any two pointers. Casting to int is potentially erroneous, as for a 32 bit int it will span only 2Gb - as such the error is unlikely.

    The second thing happening is that a string literal such as "word" when used in an expression is a pointer to the string content.

    And the third thing happening is that your linker has performed duplicate string elimination. It has exhaustively searched your code for string literals that are identical and replaced them with a single pointer. This part of your observation is implementation dependent and may not hold for all toolchains, or even the same toolchain with different compiler/linker settings.

    The built-in macro __FILE__ is a string literal containing the name of the sourcefile in which it is instantiated. In the example:

    (int)(".c" - __FILE__)
    

    __FILE__ == "filename.c" and the linker finds the duplicate ".c" within that (it must be at the end because the nul terminator must match). So the difference between the two pointer values is 8 ( the length of "filename"). So the statement:

    printf("%.*s\n", (int)(".c" - __FILE__), __FILE__);
    

    prints the first 8 characters of the string "filename.c" which is "filename".

    Something more complicated is happening with:

    (int)("word" - "rd") = 6273
    (int)("rd" - "word") = -6273
    (int)("word" - "  rd") = -5
    (int)("  rd" - "word") = -5 
    

    In the first and second cases, you might from the first __FILE__ example expect -2 and 2 respectively, however that might occur except that in this case the linker may have matched the "rd" with the end of the " rd" string rather then with the end of "word". The linker behaviour is implementation defined and non-deterministic. The results are likely to vary for example if you removed the third and fourth expressions so that the string literals no longer existed. Strings from entirely different link modules may be referenced.

    The point is that you cannot rely on this entirely undefined/implementation behaviour (the string elimination that is - the pointner arithmetic, and literal string pointer behaviour is well defined). It is interesting as an examination of linker behaviour, but is not useful as a programming technique.