regex parsing dom string-parsing structured-data

DOM parsing, structured document traversal under the hood

As a developer, and I'm certain I'm far from alone here, I'm always curious to understand what's "under the hood". DOM parsers are one of the list-toppers of this curiosity for me. We all know the famous post. I have even hacked together a bit of an "O RLY?", out of both temporary necessity and curiosity.

However my need to meet the man-behind-the-curtain remains unmet. How do DOM parsers, or any structured document parsers for that matter, parse documents? As far as my intermediate web application developer understanding can muster, it's a combination of recursive string parsing and state-keeping logic, not unlike my own hackish attempt.

Magicians should never reveal their secrets, but seriously, where is he hiding the rabbit?

Solution

There's a well-developed theory of parsing, and untold numbers of tools to support it. In general, you look at each character, one at a time, and decide when the characters you've made so far constitute a token. Then you look at the series of tokens, and decide when the sequence of tokens constitute a higher-level grammatical construct -- in this case, an HTML element. As you recognize constructs, you build a tree of nodes to represent them -- in this case, the DOM tree.

So are you familiar with context-free grammars, and compiler-compilers like yacc, bison, and their more modern counterparts? If you understand those, a DOM parser shouldn't be a mystery.