Search code examples
regexescapinglanguage-agnosticstring-literals

Why do regexes and string literals use different escape sequences?


The handling of escape sequences varies across languages and between string literals and regular expressions. For example, in Python the \s escape sequence can be used in regular expressions but not in string literals, whereas in PHP the \f form feed escape sequence can be used in regular expressions but not in string literals.

In PHP, there is a dedicated page for PCRE escape sequences (http://php.net/manual/en/regexp.reference.escape.php) but it does not have an official list of escape sequences that are exclusive to string literals.

As a beginner in programming, I am concerned that I may not have a full understanding of the background and context of this topic. Are these concerns valid? Is this an issue that others are aware of?

Why do different programming languages handle escape sequences differently between regular expressions and string literals?


Solution

  • The escape sequences found in string literals are there to stop the programing language from getting confused. For example, in many languages a string literal is denoted as characters between quotes, like so

    my_string = 'x string'
    

    But if your string contains a quote character then you need a way to tell the programming language that this should be interpreted as a literal character

    my_string = 'x's string' # this will cause bugs
    my_string = 'x\'s string' # lets the programing language know that the internal quote is literal and not the end of the string
    

    I think that most programing languages have the same set of escape sequences for string literals.

    Regexes are a different story, you can think of them as their own separate language that is written as a string literal. In a regex some characters like the period (.) have a special meaning and must be escaped to match their literal counterpart. Whereas other characters, when preceded by a backslash allow those characters to have special meaning.

    For example

    regex_string = 'A.C'  # match an A, followed by any character, followed by C
    regex_string = 'A\.C' # match an A, followed by a period, followed by C
    regex_string = 'AsC'  # match an A, followed by s, followed by C
    regex_string = 'A\sC'  # match an A, followed by a space character, followed by C
    

    Because regexes are their own mini-language it doesn't make sense that all of the escape sequences in regexes are available to normal string literals.