I'm wondering if it's possible to extract out formatted text from a HTMLDocument using AngleSharp. I'm using the following code to extract the text. The problem I have is that the extracted text runs together, there is no break between each of the elements.
var parser = new HtmlParser();
var document = parser.Parse("<script>var x = 1;</script> <h1>Some example source</h1><p>This is a paragraph element</p>");
var text = document.Body.Text();
This returns the following text
Some example sourceThis is a paragraph element
Ideally I would like it to return Some example source This is a paragraph element where there is some separation between each of the nodes text values.
I know I am late to the party, but better late than never (also I hope someone else benefits from this answer).
The comments on the question are both right. On the one hand we have the W3C specification and the document's source, which tells us that there won't be any space in the (official) serialization, on the other hand we have quite a common case to "integrate" some spaces when applicable (or maybe even newlines, e.g., if a <br>
element is seen).
That being written the library does not know your specific use case (i.e., when you want to insert spaces). However, it can assist you to get more easily to your desired state.
Serialization from the DOM to a string is done via an instance of a class that implements IMarkupFormatter
. The ToHtml()
method of any DOM node accepts such an object to return a string. Doing a
var myFormatter = new MyMarkupFormatter();
var text = document.Body.ToHtml(myFormatter);
Now the question is reduced to an implementation of MyMarkupFormatter that works for us. This formatter will essentially only yield text nodes, however, with certain tags being treated differently (i.e., returning some text such as spaces).
public class MyMarkupFormatter : IMarkupFormatter
{
String IMarkupFormatter.Comment(IComment comment)
{
return String.Empty;
}
String IMarkupFormatter.Doctype(IDocumentType doctype)
{
return String.Empty;
}
String IMarkupFormatter.Processing(IProcessingInstruction processing)
{
return String.Empty;
}
String IMarkupFormatter.Text(ICharacterData text)
{
return text.Data;
}
String IMarkupFormatter.OpenTag(IElement element, Boolean selfClosing)
{
switch (element.LocalName)
{
case "p":
return "\n\n";
case "br":
return "\n";
case "span":
return " ";
}
return String.Empty;
}
String IMarkupFormatter.CloseTag(IElement element, Boolean selfClosing)
{
return String.Empty;
}
String IMarkupFormatter.Attribute(IAttr attr)
{
return String.Empty;
}
}
If stripping all non-text info is not what you need then AngleSharp also offers the PrettyMarkupFormatter
out of the box - maybe this is already quite close to what you wanted (a "prettier" markup formatter).
Hope this helps!