Search code examples
html-agility-packstack-overflow

Work-around a StackOverflowException


I'm using HtmlAgilityPack to parse roughly 200,000 HTML documents.

I cannot predict the contents of these documents, however one such document causes my application to fail with a StackOverflowException. The document contains this HTML:

<ol>
    <li><li><li><li><li><li>...
</ol>

There are roughly 10,000 <li> elements nested like that. Due to the way HtmlAgilityPack parses HTML it causes a StackOverflowException.

Unfortunately a StackOverflowException is not catchable in .NET 2.0 and later.

I did wonder about setting a larger size for the thread's stack, but setting a larger stack size is a hack: it would cause my program to use a lot more memory (my program starts about 50 threads for processing HTML, so all of these threads would have the increased stack size) and would need manually adjusting if it ever came across a similar situation again.

Are there any other workarounds I could employ?


Solution

  • Ideally, the long-term solution is to patch HtmlAgilityPack to use a heap-stack instead of the call-stack, but that would be an undertaking too big for me. I've temporarily lost my CodePlex account details, but when I get them back I'll submit an Issue report on the problem. I also note that this issue could present a Denial-of-Service attack vulnerability to any site that uses HtmlAgilityPack to sanitize user-submitted HTML - a crafted overly-nested HTML document would cause the w3wp.exe process to die.

    In the meantime, I figured the best way forward is to manually override the maximum thread stack size. I was wrong in my earlier statement that a bigger stack-size means that all threads automatically consume that memory (it seems memory pages are allocated for a thread stack as it grows, not all-at-once).

    I made a copy of the <ol><li> page and ran some experiments. I found that my program failed when the stack size was less than 2^21 bytes (2MB) in size, but a maximum size of 2^22 bytes (4MB) succeeded - and 4MB in my book passes as an "acceptable" hack... for now.