Search code examples
c#itexthtml-agility-packmailkitxmlworker

Mailkit: Converting HtmlBody to pdf using iTextSharp XMLWorker throws "The document has no pages"


I'm trying to convert the HtmlBody of the e-mails I get from a mailserver using Mailkit and looks like iTextSharp doesn't really like the html I'm passing it.

My method works well with a "sample" html code but I get a The document has no pages error message which looks like it's thrown when the html is no html anymore.

public void GenerateHtmlFromBody(UniqueId uid)
{
    var email = imap.Inbox.GetMessage(uid);
    Byte[] bytes;

    using (var ms = new MemoryStream())
    {
        using (var doc = new Document())
        {
            using (var writer = PdfWriter.GetInstance(doc, ms))
            {
                doc.Open();

                //Sample HTML and CSS
                var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
                var example_css = @".headline{font-size:200%}";

                using (var srHtml = new StringReader(email.HtmlBody))
                {
                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
                }
                doc.Close();
            }
        }
        bytes = ms.ToArray();
    }
    var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "processedMailPdf.pdf");
    System.IO.File.WriteAllBytes(testFile, bytes);
}

I'm accesing to MimeMessage.HtmlBody and debugging, looks like it's, in fact, html.

Here is a link to pastebin for checking the HtmlBody of the MimeMessage because I hit the character limit here.

What am I missing? Thanks.

EDIT: I've tried using the HTMLWorker (which is deprecated) and it's not stable. It worked with one e-mail but not with others. Of course it wasn't a solution, but it finally generated a pdf from Mailkit, which was "something".


Solution

  • Looks like you're facing two issues with HtmlBody:

    1. It may be plain text.
    2. When [X]HTML, it is not well-formed.

    Anytime there's a possibility you're dealing with a string that is not well-formed XML, your best bet is to use a parser like HtmlAgilityPack to clean up the mess. Here's a simple helper method using XPath to cover both issues above, and UPDATED based on comments to remove HtmlCommentNodes that break iText XML Worker:

    string FixBrokenMarkup(string broken)
    {
        HtmlDocument h = new HtmlDocument()
        {
            OptionAutoCloseOnEnd = true,
            OptionFixNestedTags = true,
            OptionWriteEmptyNodes = true
        };
        h.LoadHtml(broken);
    
        // UPDATED to remove HtmlCommentNode
        var comments = h.DocumentNode.SelectNodes("//comment()");
        if (comments != null) 
        {
            foreach (var node in comments) { node.Remove(); }
        }
    
        return h.DocumentNode.SelectNodes("child::*") != null
            //                            ^^^^^^^^^^
            // XPath above: string plain-text or contains markup/tags
            ? h.DocumentNode.WriteTo()
            : string.Format("<span>{0}</span>", broken);
    }
    

    And for completeness, code to generate the PDF. Tested and working with the pastebin link you provided above:

    var fixedMarkup = FixBrokenMarkup(PASTEBIN);
    // swap initialization to verify plain-text works too
    // var fixedMarkup = FixBrokenMarkup("some text");
    
    using (var stream = new MemoryStream())
    {
        using (var document = new Document())
        {
            PdfWriter writer = PdfWriter.GetInstance(document, stream);
            document.Open();
            using (var stringReader = new StringReader(fixedMarkup))
            {
                XMLWorkerHelper.GetInstance().ParseXHtml(
                    writer, document, stringReader
                );
            }
        }
        File.WriteAllBytes(OUTPUT, stream.ToArray());
    }