Search code examples
c#htmldtext-extraction

How to extract text from resonably sane HTML?


My question is sort of like this question but I have more constraints:

  • I know the document's are reasonably sane
  • they are very regular (they all came from the same source
  • I want about 99% of the visible text
  • about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
  • I don't care about formatting or even paragraph breaks.

Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

I'm open to command line or batch processing tools as well as C/C#/D libraries.


Solution

  • You need to use the HTML Agility Pack.

    You probably want to find an element using LINQ ant the Descendants call, then get its InnerText.