Search code examples
javahtmlpdfitextpdf

Problems with HTML content in generated PDF


I am generating a PDF from HTML, but instead of interpreting it as normal text my PDF pages are filled with html tags like <p>, <li>, etc.


Solution

  • You'll need to remove all tags and unescape special chars.

    PHP example:

    $text = preg_replace($html, '<[^>]*>', '');
    $text = html_entity_decode($text);
    

    VB.NET example:

    Dim text As String = Regex.Replace(html, "<[^>]*>", "")
    text = System.Web.WebUtility.HtmlDecode(text)
    

    Java example:

    text = html.replaceAll("<[^>]*>", "");
    

    For the html entity decoding you'll find a good answer here: Java: How to unescape HTML character entities in Java?. Otherwise you could just replace them if you know all of them (&nbsp;, &quot;, ...).