During exporting from an HTML text to an Excel sheet, I'm trying to preserve basic formattings like HTML line breaks (<br>
, <p>
), lists (<ol>
, <ul>
) etc.
Example input:
<p>This is a test.</p>
<p>This is another<br>test.</p>
<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>
<p>End.</p>
Example output:
This is a test.
This is another
test.
- 10
- 20
- 30
End.
The free utility HTMLAsText from the famous NirSoft guy seems to do just what I want, unfortunately it comes with no source code:
Even after examining the approx. 20 similar questions here on Stack Overflow and browsing Google for hours, the closest thing I could find is this Code Project article.
My question therefore is:
Is anyone aware of a class/library that could convert HTML to Plain Text while preserving basic formattings?
Update 2013-05-10
I ended up with one function, see the full code over at Pastebin.
Can you not do this yourself by replacing:
<br /> with Environment.NewLine
</p> with Environment.NewLine + Environment.NewLine
<li> with " - ".
Then just strip out the rest of the HTML with regex? It would seem to achieve what you want your example output to be. Of course, someone may have a more elegant solution that that. =)