Search code examples
javaregexitalic

Regex for italic markdown


I'm trying for hours with regex: I need a regex to select all that is inside underlines. Example:

\_italic\_

But with the only condition that I need it to ignore \\_ (backslash followed by underscore).

So, this would be a match (all the text which is inside the \_):

\_italic some text 123 \\_*%&$ _

SO far I have this regex:

(\_.*?\_)(?!\\\_) 

But is not ignoring the \\_

Which regex would work?


Solution

  • You can use

    (?s)(?<!\\)(?:\\{2})*_((?:[^\\_]|\\.)+)_
    

    See the regex demo. Details:

    • (?s) - an inline embedded flag option equal to Pattern.DOTALL
    • (?<!\\)(?:\\{2})* - a position that is not immediately preceded with a backslash and then zero or more sequences of double backslashes
    • _ - an underscore
    • ((?:[^\\_]|\\.)+) - Capturing group 1: one or more occurrences of any char other than a \ and _, or any escaped char (a combination of a \ and any one char)
    • _ - an underscore

    See the Java demo:

    List<String> strs = Arrays.asList("xxx _italic some text 123 \\_*%&$ _ xxx",
                                              "\\_test_test_");
    String regex = "(?s)(?<!\\\\)(?:\\\\{2})*_((?:[^\\\\_]|\\\\.)+)_";
    Pattern p = Pattern.compile(regex);
    for (String str : strs) {
        Matcher m = p.matcher(str);
        List<String> result = new ArrayList<>();
        while(m.find()) {
            result.add(m.group(1));
        }
        System.out.println(str + " => " + String.join(", ", result));
    }
    

    Output:

    xxx _italic some text 123 \_*%&$ _ xxx => italic some text 123 \_*%&$ 
    \_test_test_ => test