Search code examples
c#.netitextpdf-generationpdf-writer

How could I define the chunk's length in a PDF generated from an HTML?


I'm managing to generate a PDF file from an e-mail, which I'm actually retrieving through MailKit.

There is no actual problem on generating the pdf file itself (I'm providing to the PdfWriter instance a clean, revamped and ready-to-go html provided by the HtmlAgilityPack).

I just want to specify each word to be a single TextChunk instead of every single phrase, which is what actually writing. I guessed it's something "specificable" since depending on the pdf printer/generator it's applicable to some documents, the TextChunk composition simply varies, sometimes being phrases, words or even single characters.

Is there any way to specify each new chunk to be inserted on the document as a single word?

This is my code but I haven't figured out how to specify that level of "chunk detailness" so far.

using (var ms = new MemoryStream())
{
    using (var doc = new Document())
    {
        using (var writer = PdfWriter.GetInstance(doc, ms))
        {
            doc.Open();
            using (var srHtml = new StringReader(message.Body.HtmlBody))
            {
                XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }
            doc.Close();
        }
    }
    File.WriteAllBytes(_outputPath, ms.ToArray());
}

Solution

  • The class TextChunk in iText is related to text extraction while your code is about PDF generation. In a comment you clarified that your use case not only encompasses the PDF generation but also a later step in which the contents of those PDFs are subject to text extraction, and that you want to somehow produce the PDFs in a way that in the text extraction step causes the TextChunk instances in the LocationTextExtractionStrategy to always contain a single, complete word each.

    First of all, the chunkiness of extracted text is not merely a custom quirk of the PDF generator in question, there is a maximum length to a chunk, it must stop at the first character for which something changes to the current settings, e.g. the color, font, font size, ..., or for which the distance to the previous character is not determined by the width of that previous character alone.

    While the former settings only seldom change in a word (but even they occasionally do), the latter anomaly can happen pretty often if the PDF generator beautifies written text by applying kerning.

    Thus, for PDF generators with kerning support you'll usually get chunks smaller than words, and you cannot prevent this unless you deny yourself kerning support.

    Inside the range allowed by these restrictions, though, it usually is an implementation detail of the PDF generator how long the chunks get, it usually is not configurable.

    In the case at hand: iText, for each consecutive piece of text it is asked to draw, creates chunks that are as long as possible, you cannot change this by configuration.

    What you can do, though, is cut down the consecutive pieces of text you draw according to your requirement! E.g. for

    <html><body><p>Header material</p></body></html>
    

    you get a single chunk "Header material" but for

    <html><body><p><span>Header</span> <span>material</span></p></body></html>
    

    you get the chunks "Header", " ", and "material"!