Search code examples
c#.nethtmlxmlsgml

Recommendation for parsing HTML and SGML file


I have a project that will accept inputs such as (html, sgml, xml and txt).

I have no problem parsing the XML files and txt files, Can you please suggest some tools that I can use on parsing html or sgml files.


Solution

  • For HTMl Parser, use the HTML Agilty Pack - it is an open source HTML parser for .NET.

    What is exactly the Html Agility Pack (HAP)?

    This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

    You can use this to query HTML and extract whatever data you wish.

    For SGML Parser

    Check out this link, SGMLReader - Convert any HTML to valid XML:

    http://developer.mindtouch.com/Community/SgmlReader

    Reference: SGML parser .NET recommendations