Search code examples
c#htmlxmlhtml-content-extraction

C# - Best Approach to Parsing Webpage?


I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.

Are regular expressions the best way to achieve what I'm trying to accomplish?


Solution

  • Regular expressions are one way to do it, but it can be problematic.

    Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.

    You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.

    UPDATE

    At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.