Search code examples
c++cescapingportabilitystring-literals

Do useless backslashs produce well-defined string constants?


Both, C and C++, support an seemingly equivalent set of escape sequences like \b, \t, \n, \" and others starting with the backslash character (\). How is a backslash handled if normal character follows? As far as I remember from several compilers the escape character \ is silently skipped. On cppreference.com, I read these articles

I only found this note (in the C article) about orphan backslashes

ISO C requires a diagnostic if the backslash is followed by any character not listed here: [...]

above the reference table. I had also a look an some online compilers

C demo

#include <stdio.h>

int main(void) {
    // your code goes here
    printf("%d", !strcmp("\\ x", "\\ x"));
    printf("%d", !strcmp("\\ x", "\\\ x"));
    printf("%d", !strcmp("\\ x", "\\\\ x"));
    return 0;
}

C++ demo

#include <iostream>
#include <string>
using namespace std;

int main() {
    cout << (string("\\ x") == "\\ x");
    cout << (string("\\ x") == "\\\ x");
    cout << (string("\\ x") == "\\\\ x");
    return 0;
}

Both treat "\\ x" and "\\\ x" as equivalent, (kind of) warning via syntax highlighting. IOW "\\\ x" has been transformed into "\\ x".

Can I assume this to be defined behavior?

Clarification (edit)

  • I'm not asking about obviously invalid string literals like "\".
  • I'm aware that an orphan backslash is somewhat problematic.
  • I want to know if the result, the constant built by the compiler, is defined.

Edit #2: Focus even more on constant being generated (and portability).


Solution

  • Answer is no. It is an invalid C program and unspecified behavior C++ one.

    C Standard

    says it is syntactically wrong (emphasize is mine), it does not produce a valid token, thus the program is invalid:

    5.2.1 Character sets

    2/ In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters.

    6.4.4.4 Character constants

    3/ The single-quote ', the double-quote ", the question-mark ?, the backslash \, and arbitrary integer values are representable according to the following table of escape sequences:

    • single quote ' \'
    • double quote " \"
    • question mark ? \?
    • backslash \ \\
    • octal character \octal digits
    • hexadecimal character \xhexadecimal digits

    8/ In addition, characters not in the basic character set are representable by universal character names and certain nongraphic characters are representable by escape sequences consisting of the backslash \ followed by a lowercase letter: \a, \b, \f, \n, \r, \t, and \v. Note : If any other character follows a backslash, the result is not a token and a diagnostic is required.

    C++ standard

    says differently (emphasize is mine):

    5.13.3 Character literals

    7/ Certain non-graphic characters, the single quote ’, the double quote ", the question mark ?,25 and the backslash \, can be represented according to Table 8. The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ’ and the backslash \ shall be represented by the escape sequences \’ and \ respectively. Escape sequences in which the character following the backslash is not listed in Table 8 are conditionally-supported, with implementation-defined semantics. An escape sequence specifies a single character.

    Thus for C++, you need to have a look at your compiler manual for the semantic, but the program is syntactically valid.