Search code examples
pdfitexthtml-parsinghtml-agility-packxmlworker

How can I use iText to convert HTML with images and hyperlinks to PDF?


I'm trying to convert HTML to PDF using iTextSharp in an ASP.NET web application that uses both MVC, and web forms. The <img> and <a> elements have absolute and relative URLs, and some of the <img> elements are base64. Typical answers here at SO and Google search results use generic HTML to PDF code with XMLWorkerHelper that looks something like this:

using (var stringReader = new StringReader(xHtml))
{
    using (Document document = new Document())
    {
        PdfWriter writer = PdfWriter.GetInstance(document, stream);
        document.Open();
        XMLWorkerHelper.GetInstance().ParseXHtml(
            writer, document, stringReader
        );
    }
}

So with sample HTML like this:

<div>
    <h3>HTML Works, but Broken in Converted PDF</h3>
    <div>Relative local <img>: <img src='./../content/images/kuujinbo_320-30.gif' /></div>
    <div>
        Base64 <img>:
        <img src='' />
    </div>
    <div><a href='/somePage.html'>Relative local hyperlink, broken in PDF</a></div>
<div>

The resulting PDF: (1) is missing all images, and (2) all hyperlink(s) with relative URLs are broken and use a file URI scheme (file///XXX...) instead of pointing to the correct web site.

Some answers here at SO and others from Google search recommend replacing relative URLs with absolute URLs, which is perfectly acceptable for one-off cases. However, globally replacing all <img src> and <a href> attributes with a hard-coded string is unacceptable for this question, so please do not post an answer like that, because it will accordingly be downvoted.

Am looking for a solution that works for many different web applications residing in test, development, and production environments.


Solution

  • Out of the box XMLWorker only understands absolute URIs, so the described issues are expected behavior. The parser can't automagically deduce URI schemes or paths without some additional information.

    Implementing an ILinkProvider fixes the broken hyperlink problem, and implementing an IImageProvider fixes the broken image problem. Since both implementations must perform URI resolution, that's the first step. The following helper class does that, and also tries to make web (ASP.NET) context calls (examples follow) as simple as possible:

    // resolve URIs for LinkProvider & ImageProvider
    public class UriHelper
    {
        /* IsLocal; when running in web context:
         * [1] give LinkProvider http[s] scheme; see CreateBase(string baseUri)
         * [2] give ImageProvider relative path starting with '/' - see:
         *     Join(string relativeUri)
         */
        public bool IsLocal { get; set; }
        public HttpContext HttpContext { get; private set; }
        public Uri BaseUri { get; private set; }
    
        public UriHelper(string baseUri) : this(baseUri, true) {}
        public UriHelper(string baseUri, bool isLocal)
        {
            IsLocal = isLocal;
            HttpContext = HttpContext.Current;
            BaseUri = CreateBase(baseUri);
        }
    
        /* get URI for IImageProvider to instantiate iTextSharp.text.Image for 
         * each <img> element in the HTML.
         */
        public string Combine(string relativeUri)
        {
            /* when running in a web context, the HTML is coming from a MVC view 
             * or web form, so convert the incoming URI to a **local** path
             */
            if (HttpContext != null && !BaseUri.IsAbsoluteUri && IsLocal)
            {
                return HttpContext.Server.MapPath(
                    // Combine() checks directory traversal exploits
                    VirtualPathUtility.Combine(BaseUri.ToString(), relativeUri)
                );
            }
            return BaseUri.Scheme == Uri.UriSchemeFile 
                ? Path.Combine(BaseUri.LocalPath, relativeUri)
                // for this example we're assuming URI.Scheme is http[s]
                : new Uri(BaseUri, relativeUri).AbsoluteUri;
        }
    
        private Uri CreateBase(string baseUri)
        {
            if (HttpContext != null)
            {   // running on a web server; need to update original value  
                var req = HttpContext.Request;
                baseUri = IsLocal
                    // IImageProvider; absolute virtual path (starts with '/')
                    // used to convert to local file system path. see:
                    // Combine(string relativeUri)
                    ? req.ApplicationPath
                    // ILinkProvider; absolute http[s] URI scheme
                    : req.Url.GetLeftPart(UriPartial.Authority)
                        + HttpContext.Request.ApplicationPath;
            }
    
            Uri uri;
            if (Uri.TryCreate(baseUri, UriKind.RelativeOrAbsolute, out uri)) return uri;
    
            throw new InvalidOperationException("cannot create a valid BaseUri");
        }
    }
    

    Implementing ILinkProvider is pretty simple now that UriHelper gives the base URI. We just need the correct URI scheme (file or http[s]):

    // make hyperlinks with relative URLs absolute
    public class LinkProvider : ILinkProvider
    {
        // rfc1738 - file URI scheme section 3.10
        public const char SEPARATOR = '/';
        public string BaseUrl { get; private set; }
    
        public LinkProvider(UriHelper uriHelper)
        {
            var uri = uriHelper.BaseUri;
            /* simplified implementation that only takes into account:
             * Uri.UriSchemeFile || Uri.UriSchemeHttp || Uri.UriSchemeHttps
             */
            BaseUrl = uri.Scheme == Uri.UriSchemeFile
                // need trailing separator or file paths break
                ? uri.AbsoluteUri.TrimEnd(SEPARATOR) + SEPARATOR
                // assumes Uri.UriSchemeHttp || Uri.UriSchemeHttps
                : BaseUrl = uri.AbsoluteUri;
        }
    
        public string GetLinkRoot()
        {
            return BaseUrl;
        }
    }
    

    IImageProvider only requires implementing a single method, Retrieve(string src), but Store(string src, Image img) is easy - note inline comments there and for GetImageRootPath():

    // handle <img> elements in HTML  
    public class ImageProvider : IImageProvider
    {
        private UriHelper _uriHelper;
        // see Store(string src, Image img)
        private Dictionary<string, Image> _imageCache = 
            new Dictionary<string, Image>();
    
        public virtual float ScalePercent { get; set; }
        public virtual Regex Base64 { get; set; }
    
        public ImageProvider(UriHelper uriHelper) : this(uriHelper, 67f) { }
        //              hard-coded based on general past experience ^^^
        // but call the overload to supply your own
        public ImageProvider(UriHelper uriHelper, float scalePercent)
        {
            _uriHelper = uriHelper;
            ScalePercent = scalePercent;
            Base64 = new Regex( // rfc2045, section 6.8 (alphabet/padding)
                @"^data:image/[^;]+;base64,(?<data>[a-z0-9+/]+={0,2})$",
                RegexOptions.Compiled | RegexOptions.IgnoreCase
            );
        }
    
        public virtual Image ScaleImage(Image img)
        {
            img.ScalePercent(ScalePercent);
            return img;
        }
    
        public virtual Image Retrieve(string src)
        {
            if (_imageCache.ContainsKey(src)) return _imageCache[src];
    
            try
            {
                if (Regex.IsMatch(src, "^https?://", RegexOptions.IgnoreCase))
                {
                    return ScaleImage(Image.GetInstance(src));
                }
    
                Match match;
                if ((match = Base64.Match(src)).Length > 0)
                {
                    return ScaleImage(Image.GetInstance(
                        Convert.FromBase64String(match.Groups["data"].Value)
                    ));
                }
    
                var imgPath = _uriHelper.Combine(src);
                return ScaleImage(Image.GetInstance(imgPath));
            }
            // not implemented to keep the SO answer (relatively) short
            catch (BadElementException ex) { return null; }
            catch (IOException ex) { return null; }
            catch (Exception ex) { return null; }
        }
    
        /*
         * always called after Retrieve(string src):
         * [1] cache any duplicate <img> in the HTML source so the image bytes
         *     are only written to the PDF **once**, which reduces the 
         *     resulting file size.
         * [2] the cache can also **potentially** save network IO if you're
         *     running the parser in a loop, since Image.GetInstance() creates
         *     a WebRequest when an image resides on a remote server. couldn't
         *     find a CachePolicy in the source code
         */
        public virtual void Store(string src, Image img)
        {
            if (!_imageCache.ContainsKey(src)) _imageCache.Add(src, img);
        }
    
        /* XMLWorker documentation for ImageProvider recommends implementing
         * GetImageRootPath():
         * 
         * http://demo.itextsupport.com/xmlworker/itextdoc/flatsite.html#itextdoc-menu-10
         * 
         * but a quick run through the debugger never hits the breakpoint, so 
         * not sure if I'm missing something, or something has changed internally 
         * with XMLWorker....
         */
        public virtual string GetImageRootPath() { return null; }
        public virtual void Reset() { }
    }
    

    Based on the XML Worker documentation it's pretty straightforward to hook the implementations of ILinkProvider and IImageProvider above into a simple parser class:

    /* a simple parser that uses XMLWorker and XMLParser to handle converting 
     * (most) images and hyperlinks internally
     */
    public class SimpleParser
    {
        public virtual ILinkProvider LinkProvider { get; set; }
        public virtual IImageProvider ImageProvider { get; set; }
    
        public virtual HtmlPipelineContext HtmlPipelineContext { get; set; }
        public virtual ITagProcessorFactory TagProcessorFactory { get; set; }
        public virtual ICSSResolver CssResolver { get; set; }
    
        /* overloads simplfied to keep SO answer (relatively) short. if needed
         * set LinkProvider/ImageProvider after instantiating SimpleParser()
         * to override the defaults (e.g. ImageProvider.ScalePercent)
         */
        public SimpleParser() : this(null) { }
        public SimpleParser(string baseUri)
        {
            LinkProvider = new LinkProvider(new UriHelper(baseUri, false));
            ImageProvider = new ImageProvider(new UriHelper(baseUri, true));
    
            HtmlPipelineContext = new HtmlPipelineContext(null);
    
            // another story altogether, and not implemented for simplicity 
            TagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
            CssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
        }
    
        /*
         * when sending XHR via any of the popular JavaScript frameworks,
         * <img> tags are **NOT** always closed, which results in the 
         * infamous iTextSharp.tool.xml.exceptions.RuntimeWorkerException:
         * 'Invalid nested tag a found, expected closing tag img.' a simple
         * workaround.
         */
        public virtual string SimpleAjaxImgFix(string xHtml)
        {
            return Regex.Replace(
                xHtml,
                "(?<image><img[^>]+)(?<=[^/])>",
                new MatchEvaluator(match => match.Groups["image"].Value + " />"),
                RegexOptions.IgnoreCase | RegexOptions.Multiline
            );
        }
    
        public virtual void Parse(Stream stream, string xHtml)
        {
            xHtml = SimpleAjaxImgFix(xHtml);
    
            using (var stringReader = new StringReader(xHtml))
            {
                using (Document document = new Document())
                {
                    PdfWriter writer = PdfWriter.GetInstance(document, stream);
                    document.Open();
    
                    HtmlPipelineContext
                        .SetTagFactory(Tags.GetHtmlTagProcessorFactory())
                        .SetLinkProvider(LinkProvider)
                        .SetImageProvider(ImageProvider)
                    ;
                    var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
                    var htmlPipeline = new HtmlPipeline(HtmlPipelineContext, pdfWriterPipeline);
                    var cssResolverPipeline = new CssResolverPipeline(CssResolver, htmlPipeline);
    
                    XMLWorker worker = new XMLWorker(cssResolverPipeline, true);
                    XMLParser parser = new XMLParser(worker);
                    parser.Parse(stringReader);
                }
            }
        }
    }
    

    As commented inline, SimpleAjaxImgFix(string xHtml) specifically handles XHR that may send unclosed <img> tags, which is valid HTML, but invalid XML that will break XMLWorker . A simple explanation & implementation of how to receive a PDF or other binary data with XHR and iTextSharp can be found here.

    A Regex was used in SimpleAjaxImgFix(string xHtml) so that anyone using (copy/paste?) the code doesn't need to add another nuget package, but a HTML parser like HtmlAgilityPack should be used, since it's turns this:

    <div><img src='a.gif'><br><hr></div>
    

    into this:

    <div><img src='a.gif' /><br /><hr /></div>
    

    with only a few lines of code:

    var hDocument = new HtmlDocument()
    {
        OptionWriteEmptyNodes = true,
        OptionAutoCloseOnEnd = true
    };
    hDocument.LoadHtml("<div><img src='a.gif'><br><hr></div>");
    var closedTags  = hDocument.DocumentNode.WriteTo();
    

    Also of note - use SimpleParser.Parse() above as a general blueprint to additionally implement a custom ICSSResolver or ITagProcessorFactory, which is explained in the documentation.

    Now the issues described in the question should be taken care of. Called from a MVC Action Method:

    [HttpPost]  // some browsers have URL length limits
    [ValidateInput(false)] // or throws HttpRequestValidationException
    public ActionResult Index(string xHtml)
    {
        Response.ContentType = "application/pdf";
        Response.AppendHeader(
            "Content-Disposition", "attachment; filename=test.pdf"
        );
        var simpleParser = new SimpleParser();
        simpleParser.Parse(Response.OutputStream, xHtml);
    
        return new EmptyResult();
    }
    

    or from a Web Form that gets HTML from a server control:

    Response.ContentType = "application/pdf";
    Response.AppendHeader("Content-Disposition", "attachment; filename=test.pdf");
    using (var stringWriter = new StringWriter())
    {
        using (var htmlWriter = new HtmlTextWriter(stringWriter))
        {
            ConvertControlToPdf.RenderControl(htmlWriter);
        }
        var simpleParser = new SimpleParser();
        simpleParser.Parse(Response.OutputStream, stringWriter.ToString());
    }
    Response.End();
    

    or a simple HTML file with hyperlinks and images on the file system:

    <h1>HTML Page 00 on Local File System</h1>
    <div>
        <div>
            Relative &lt;img&gt;: <img src='Images/alt-gravatar.png' />
        </div>
        <div>
            Hyperlink to file system HTML page: 
            <a href='file-system-html-01.html'>Page 01</a>
        </div>
    </div>
    

    or HTML from a remote web site:

    <div>
        <div>
            <img width="200" alt="Wikipedia Logo"
                 src="portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png">
        </div>
        <div lang="en">
            <a href="https://en.wikipedia.org/">English</a>
        </div>
        <div lang="en">
            <a href="wiki/IText">iText</a>
        </div>
    </div>
    

    Above two HTML snippets run from a console app:

    var filePaths = Path.Combine(basePath, "file-system-html-00.html");
    var htmlFile = File.ReadAllText(filePaths);
    var remoteUrl = Path.Combine(basePath, "wikipedia.html");
    var htmlRemote = File.ReadAllText(remoteUrl);
    var outputFile = Path.Combine(basePath, "filePaths.pdf");
    var outputRemote = Path.Combine(basePath, "remoteUrl.pdf");
    
    using (var stream = new FileStream(outputFile, FileMode.Create))
    {
        var simpleParser = new SimpleParser(basePath);
        simpleParser.Parse(stream, htmlFile);
    }
    using (var stream = new FileStream(outputRemote, FileMode.Create))
    {
        var simpleParser = new SimpleParser("https://wikipedia.org");
        simpleParser.Parse(stream, htmlRemote);
    }
    

    Quite a long answer, but taking a look at questions here at SO tagged html, pdf, and itextsharp, as of this writing (2016-02-23) there are 776 results against 4,063 total tagged itextsharp - that's 19%.