I have tried using (and have been advised against) using regular expressions for this task (here) - and so instead i tried using the HTMLAgilityPack in this manner however its resulting text is very poor, html lists (<ol><li></ol>
) are completely lost and just result in clumped together paragraphs.
In this question i saw that lynx (compiled for windows) was recommended as a good alternative, however i am having trouble getting this working - how would one use lynx.exe to convert html (stored in a .net string) to a presentable plain text string with line breaks etc.
The only way i can think off is by writing the html to a file, using .nets system.process to call lynx.exe -dump and read the resulting file - this seems very clumsy.
Is there a better way of doing it? What would the exact lynx.exe command line be for such a task?
The LYNX implementation i am using is this one:
http://invisible-island.net/datafiles/release/lynx-cs-setup.exe
Edit: Made some progress, this is the command line i've been using:
lynx.exe -dump "d:\test.html" >d:\output.txt
It sort of works but if i open the resulting file in notepad its all on one line (because lynx is only using Line Feed characters for new lines whereas notepad needs carriage returns to render properly.
Also, its inserting way too many line feeds after </li>
& <br />
tags its doing two Line Feeds:
Hello, this is a normal line of text.
Next an ordered list:
1. The
2. Quick
3. Brown Fox
4. Jumped
I can work around this by replacing two consecutive LF's with just the one LF, but i'm still after a c# wrapper for all this.
Edit 2 - My final solution based on Christian's answer:
Function ConvertHtmlToPlainText(ByVal HtmlString As String) As String
'#### Define FileBuffer Path
Dim HtmlBuffer As String = WorkingRoot & "HtmlBuffer.html"
'#### Delete any old buffer files
Try
If File.Exists(HtmlBuffer) = True Then
File.Delete(HtmlBuffer)
End If
Catch ex As Exception
Return "Error: Deleting old buffer file: " & ex.Message
End Try
'#### Write the HTML to the buffer file
Try
File.WriteAllText(WorkingRoot & "HtmlBuffer.html", HtmlString)
Catch ex As Exception
Return "Error: Writing new buffer file: " & ex.Message
End Try
'#### Check the file was written OK
If File.Exists(HtmlBuffer) = False Then
Return "Error: HTML Buffer file was not written successfully."
End If
'#### Read the buffer file with Lynx and capture plain text output
Try
Dim p = New Process()
p.StartInfo = New ProcessStartInfo(LynxPath, "-dump -width 1000 " & HtmlBuffer)
p.StartInfo.WorkingDirectory = WorkingRoot
p.StartInfo.UseShellExecute = False
p.StartInfo.RedirectStandardOutput = True
p.StartInfo.RedirectStandardError = True
p.StartInfo.WindowStyle = ProcessWindowStyle.Hidden
p.StartInfo.CreateNoWindow = True
p.Start()
p.WaitForExit()
'#### Grab the text rendered by Lynx
Dim text As String = p.StandardOutput.ReadToEnd()
Return text.Replace(vbLf & vbLf, vbLf)
Catch ex As Exception
Return "Error: Error running LYNX to parse the buffer: " & ex.Message
End Try
End Function
Using this you can invoke Lynx, grab the output from the redirected StandardOutput into a string without writing it to a file first.
using System;
using System.Diagnostics;
namespace Lynx.Dumper
{
public class Dampler
{
public void fdksfjh()
{
var url = "http://www.google.com";
var p = new Process();
p.StartInfo = new ProcessStartInfo("c:/tools/lynx_w32/lynx.exe", "-dump -nolist " + url)
{
WorkingDirectory = "c:/tools/lynx_w32/",
UseShellExecute = false,
RedirectStandardOutput = true,
RedirectStandardError = true,
WindowStyle = ProcessWindowStyle.Hidden,
CreateNoWindow = true
};
p.Start();
p.WaitForExit();
//grab the text rendered by Lynx
var text = p.StandardOutput.ReadToEnd();
Console.WriteLine(text);
}
}
}