Search code examples
c#htmlstringhtml-parsing

Convert HTML to Plain Text while preserving P, BR, UL, OL?


During exporting from an HTML text to an Excel sheet, I'm trying to preserve basic formattings like HTML line breaks (<br>, <p>), lists (<ol>, <ul>) etc.

Example input:

<p>This is a test.</p>
<p>This is another<br>test.</p>

<ul>
    <li>10</li>
    <li>20</li>
    <li>30</li>
</ul>

<p>End.</p>

Example output:

This is a test.

This is another
test.

- 10
- 20
- 30

End.

The free utility HTMLAsText from the famous NirSoft guy seems to do just what I want, unfortunately it comes with no source code:

enter image description here

Even after examining the approx. 20 similar questions here on Stack Overflow and browsing Google for hours, the closest thing I could find is this Code Project article.

My question therefore is:

Is anyone aware of a class/library that could convert HTML to Plain Text while preserving basic formattings?

Update 2013-05-10

I ended up with one function, see the full code over at Pastebin.


Solution

  • Can you not do this yourself by replacing:

    <br /> with Environment.NewLine
    </p> with Environment.NewLine + Environment.NewLine
    <li> with " - ".
    

    Then just strip out the rest of the HTML with regex? It would seem to achieve what you want your example output to be. Of course, someone may have a more elegant solution that that. =)