Search code examples
regexhierarchyhierarchical

hierarchical regex expression


Is it possible/practical to build a single regular expression that matches hierarchical data?

For example:

<h1>Action</h1>
  <h2>Title1</h2><div>data1</div>
  <h2>Title2</h2><div>data2</div>
<h1>Adventure</h1>
  <h2>Title3</h2><div>data3</div>

I would like to end up with matches.

"Action", "Title1", "data1"
"Action", "Title2", "data2"
"Adventure", "Title3", "data3"

As I see it this would require knowing that there is a hierarchical structure at play here and if I code the pattern to capture the H1, it only matches the first entry of that hierarchy. If I don't code for H1 then I can't capture it. Was wondering if there are any special tricks I an employ to solve this.

This is a .NET project.


Solution

  • The solution is to not use regular expressions. They're not powerful enough for this sort of thing.

    What you want is a parser - since it looks like you're trying to match HTML, there are plenty to choose from.