Search code examples
javascriptregexnewlinecarriage-returnlinefeed

Why does LF and CRLF behave differently with /^\s*$/gm regex?


I've been seeing this issue on Windows. When I try to clear any whitespace on each line on Unix:

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

This produces what I expect:

===

HELLO

WOLRD

===

i.e. if there were spaces on blank lines, they'd get removed. On the other hand, on Windows, the regex clears the WHOLE string. To illustrate:

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

(template literals will always print only \n in JS, so I had to replace with \r\n to emulate Windows (? after \r just to be sure for those who don't believe). The result:

===
HELLO
WOLRD
===

The whole line is gone! But my regex has ^ and $ with the m flag set, so it's kind of /^-to-$/m. What's the difference between \r and \r\n then that makes it produce different results?

when I do some logging

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

With \r\n I'm seeing

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

and with \n only

matched
matched
matched
===

HELLO

WOLRD

===

Solution

  • TL;DR a pattern including whitespace and line breaks will also match characters part of a \r\n sequence, if you let it.

    First of all, let's actually examine what characters are there and aren't there when you do a replacement. Starting with a string that only uses line feeds:

    const inputLF =
    `===
    
    HELLO
    
    WOLRD
    
    ===`.replace(/\r?\n/g, "\n");
    
    console.log('------------ INPUT ')
    console.log(inputLF);
    console.log('------------')
    
    debugPrint(inputLF, 2);
    debugPrint(inputLF, 3);
    debugPrint(inputLF, 4);
    debugPrint(inputLF, 5);
    
    const replaceLF = inputLF.replace(/^\s+$/gm, '');
    
    console.log('------------ REPLACEMENT')
    console.log(replaceLF);
    console.log('------------')
    
    debugPrint(replaceLF, 2);
    debugPrint(replaceLF, 3);
    debugPrint(replaceLF, 4);
    debugPrint(replaceLF, 5);
    
    console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
    console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
    console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
    console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);
    
    console.log('------------')
    console.log('inputLF === replaceLF :', inputLF === replaceLF)
    
    function debugPrint(str, charIndex) {
      console.log(`index: ${charIndex}
       charcode: ${str.charCodeAt(charIndex)}
       character: ${str.charAt(charIndex)}`
     );
    }

    Each line ends with char code 10 which is the Line Feed (LF) character that is represented in a string literal with \n. Before and after the replacement, the two strings are the same - not only look the same but actually equal each other, so the replacement did nothing.

    Now let's examine the other case:

    const inputCRLF =
    `===
    
    HELLO
    
    WOLRD
    
    ===`.replace(/\r?\n/g, "\r\n")
    console.log('------------ INPUT ')
    console.log(inputCRLF);
    console.log('------------')
    
    debugPrint(inputCRLF, 2);
    debugPrint(inputCRLF, 3);
    debugPrint(inputCRLF, 4);
    debugPrint(inputCRLF, 5);
    debugPrint(inputCRLF, 6);
    debugPrint(inputCRLF, 7);
    
    const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;
    
    console.log('------------ REPLACEMENT')
    console.log(replaceCRLF);
    console.log('------------')
    
    debugPrint(replaceCRLF, 2);
    debugPrint(replaceCRLF, 3);
    debugPrint(replaceCRLF, 4);
    debugPrint(replaceCRLF, 5);
    
    function debugPrint(str, charIndex) {
      console.log(`index: ${charIndex}
       charcode: ${str.charCodeAt(charIndex)}
       character: ${str.charAt(charIndex)}`
     );
    }

    This time each line ends with char code 13 which is the Carriage Return (CR) character that is represented in a string literal with \r and then the LF follows. After the replacement, instead of having a sequence of =\r\n\r\nH instead it's not just =\r\nH. Let's look at why.

    Here is what MDN says about the meta character ^:

    Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character.

    And here is what MDN says about the meta character $

    Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character.

    So they match after and before a line break character. In that, MDN means the LF or the CR. This can be seen if we test a string that contains different line breaks:

    const stringLF = "hello\nworld";
    const stringCRLF = "hello\r\nworld";
    
    const regexStart = /^\s/m;
    const regexEnd = /\s$/m;
    
    console.log(regexStart.exec(stringLF));
    console.log(regexStart.exec(stringCRLF));
    
    console.log(regexEnd.exec(stringLF));
    console.log(regexEnd.exec(stringCRLF));

    If we try to match whitespace near a line break, this doesn't match anything if there is an LF but it does match the CR with CRLF. So, in that case $ would match here:

    "hello\r\nworld"
            ^^ what `^\s` matches
    
    "hello\r\nworld"
          ^^ what `\s$` matches
    

    So both ^ and $ recognise either of the CRLF sequence as end of line. This will make a difference when you do a search and replace. Since your regex specifies ^\s+$ that means that when you have a line that is entirely \r\n then it matches. But for a reason that is not obvious:

    const re = /^\s+$/m;
    
    const sringLF = "hello\n\nworld";
    const stringCRLF = "hello\r\n\r\nworld";
    
    
    console.log(re.exec(sringLF));
    console.log(re.exec(stringCRLF));

    So, the regex doesn't match an\r\n but rather \n\r (two whitespace characters) between two other line breakcharacters. That's because + is eager and will consume as much of the character sequence as it can get away with. Here is what the regex engine will try. Somewhat simplified for brevity:

    input = "hello\r\n\r\nworld
    regex = /^\s+$/
    
    Step 1
    hello[\r]\n\r\nworld
        matches `^`, symbol satisfied -> continue with next symbol in regex
    
    Step 2
    hello[\r\n]\r\nworld
        matches `^\s+` -> continue matching to satisfy `+` quantifier
    
    Step 3
    hello[\r\n\r]\nworld
        matches `^\s+` -> continue matching to satisfy `+` quantifier
    
    Step 4
    hello[\r\n\r\n]world
        matches `^\s+` -> continue matching to satisfy `+` quantifier
    
    Step 5
    hello[\r\n\r\nw]orld
        does not match `\s` -> backtrack
    
    Step 6
    hello[\r\n\r\n]world
        matches `^\s+`, quantifier satisfied -> continue to next symbol in regex
    
    Step 7
    hello[\r\n\r\nw]orld
        does not match `$` in `^\s+$` -> backtrack
    
    Step 8
    hello[\r\n\r\n]world
        matches `^\s+$`, last symbol satisfied -> finish
    

    Lastly, there is something slightly hidden here - it matters that you're matching whitespace. This is because it will behave differently to most other symbols in that it explicitly matches a line break character, whereas . will not:

    Matches any single character except line terminators

    So, if you specify \s$ this will match the CR in \r\n because the regex engine is forced to look for a match for both \s and $, therefore it finds the \r before the \n. However, this will not happen for many other patterns, since $ will usually be satisfied when it's before CR (or at the end of the string).

    Same with ^\s it will explicitly look for a whitespace character after a linebreak which is satisfied by the LF in CRLF, however if you're not seeking that, then it will happily match after the LF:

    const stringLF = "hello\nworld";
    const stringCRLF = "hello\r\nworld";
    
    const regexStartAll = /^./mg;
    const regexEndAll = /.$/gm;
    
    console.log(stringLF.match(regexStartAll));
    console.log(stringCRLF.match(regexStartAll));
    
    console.log(stringLF.match(regexEndAll));
    console.log(stringCRLF.match(regexEndAll));

    So, all of this means that ^\s+$ has some unintuitive behaviour yet perfectly coherent once you understand that the regex engine matches exactly what you tell it to.