Search code examples
javascriptregexregex-lookarounds

Regex for Markdown Emphasis


I'm trying to match the following markdown text for emphasis:

_this should match_
__this shouldn't__
_ neither should this _
_nor this _
this _should match_as well_
__       (double underscore, shouldn't match)

The issue that I'm facing with my own efforts as well as other solutions on SO is that they still end up matching the third line:

_ neither should this _

Is there a way to check of my particular use case? I'm aiming this for browser applications, and since Firefox and Safari are yet to support lookbehinds, is there a way to do this without lookbehinds?

Here's the regex pattern that I've come up with so far: /(_)((?!\1|\s).*)?\1/

Luckily, I'm able to fulfil almost all of my checks, however my pattern still matches:

_nor this _
__       (double underscore, shouldn't match)    

So, is there a way to ensure that there is atleast one character between the underscores, and that they are not separated from the text by a space?

Link to regexr playground: regexr.com/5300j

Example:

const regex = /(_)((?!\1|\s).*)?\1/gm;
const str = `_this should match_
__this shouldn't__
_ neither should this _
_nor this _
this _should match_as well_
__
_ neither should this _`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}


Solution

  • You may use either of

    \b_(?![_\s])(.*?[^_\s])_\b
    \b_(?![_\s])(.*?[^_\s])_(?!\S)
    

    See the regex demo

    Details

    • \b - no word char (letter, digit, _) allowed immediately before the match
    • _ - an underscore
    • (?![_\s]) - no _ or whitespace chars are allowed immediately after _
    • (.*?[^_\s]) - Group 1:
      • .*? - any 0 or more chars other than line break chars, as few as possible
      • [^_\s] - any 1 char other than _ and whitespace
    • _ - an underscore
    • \b - no word char allowed immediately after the _.

    Note that (?!\S) fails the match if there is no non-whitespace char immediately to the right of the current location and acts as a right-hand whitespace boundary.