Search code examples
javaregexstack-overflow

Why does java's Regex throws stackoverflow exception sometimes?


I am trying to make a regular expression for HTML tags. The regex I've created so far is <(/?)(\w+?)(\s(.*?))*?((/>)|>), when I tested it online it worked perfectly; but when I tested it using Java regex it sometimes throws StackOverFlowError and sometimes it doesn't.

I'am using this code for testing :

public static void parseHtml(String urlString){
    new Thread(new Runnable() {
        @Override
        public void run() {
            int count = 0;
            int count2 = 0;
            String htmlScript = downloadWebPage(urlString);
            Matcher matcher = Pattern.compile("<(/?)(\\w+?)(\\s(.*?))*?((/>)|>)",
                                              Pattern.DOTALL).matcher(htmlScript);
            while(matcher.find()) {
                System.out.println(matcher.group());
            }
        }
    }).start();
}

So, my question is : Why does Java's regex engine throws StackOverFlowError sometimes and sometimes it doesn't?

Note: I used the same test input (The same URL), and it threw the error, and tested it again later and it worked nicely.


Solution

  • I think Java notoriously doesn't like alternations in certain circumstances
    where there are potential backtrack problems.

    So, this part (\s(.*?))*? creates an undo burden on the backtrack
    mechanism.

     (                             # (3 start)
          \s
          ( .*? )                       # (4)
     )*?                           # (3 end)
    

    Where the end result is a nested optional quantifiers.
    It can be reduced to ([\S\s]*?) without having the nesting issues.

    Also, this part ((/>)|>) can be reduced to (/?>) eliminating the need
    for another stack frame via the alternation.

    On a whole, you don't really need the capture groups.


    If you just need to pars the tags, which is the initial level of html
    parsing, then using regex is ok.

    If you want to do more than parse individual tags, a DOM parser is needed.

    I find this regex will parse all individual html/xml tags.

    https://regex101.com/r/YXhCxe/1

    "<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"

    Expanded

     <
     (?:
          (?:
               (?:
                    # Invisible content; end tag req'd
                    (                             # (1 start)
                         script
                      |  style
                      |  object
                      |  embed
                      |  applet
                      |  noframes
                      |  noscript
                      |  noembed 
                    )                             # (1 end)
                    (?:
                         \s+ 
                         (?>
                              " [\S\s]*? "
                           |  ' [\S\s]*? '
                           |  (?:
                                   (?! /> )
                                   [^>] 
                              )?
                         )+
                    )?
                    \s* >
               )
    
               [\S\s]*? </ \1 \s* 
               (?= > )
          )
    
       |  (?: /? [\w:]+ \s* /? )
       |  (?:
               [\w:]+ 
               \s+ 
               (?:
                    " [\S\s]*? " 
                 |  ' [\S\s]*? ' 
                 |  [^>]? 
               )+
               \s* /?
          )
       |  \? [\S\s]*? \?
       |  (?:
               !
               (?:
                    (?: DOCTYPE [\S\s]*? )
                 |  (?: \[CDATA\[ [\S\s]*? \]\] )
                 |  (?: -- [\S\s]*? -- )
                 |  (?: ATTLIST [\S\s]*? )
                 |  (?: ENTITY [\S\s]*? )
                 |  (?: ELEMENT [\S\s]*? )
               )
          )
     )
     >