Search code examples
c#pdfitextitext7

iText difference between word generated PDF and CorelDraw generated PDF


I'm trying to get the location of a specific string in PDF generated from different sources (Word, Excel, CorelDraw..). I have extracted part of the code that in my opinion is relevant to this question:

    private static iText.Kernel.Geom.Rectangle? textRect;
    private static readonly string specificText = "Test";
    private void BtnNewPDF_Click(object sender, RoutedEventArgs e)
    {

        //Initialize PDF document
        PdfDocument pdfDoc = new(new PdfReader(@"D:\Word-test.pdf"));

        using Document document = new(pdfDoc);

        // Get the specified page from the PDF document
        PdfPage pdfPage = pdfDoc.GetPage(1);

        // Create a PdfCanvasProcessor object
        PdfCanvasProcessor canvasProcessor = new(new MyEventListener());

        // Process the page
        canvasProcessor.ProcessPageContent(pdfPage);

        if (textRect is not null)
        {
            MessageBox.Show(textRect.GetX().ToString());
            MessageBox.Show(textRect.GetY().ToString());
        }

        document.Close();
    }

    class MyEventListener : IEventListener
    {
        public void EventOccurred(IEventData data, EventType type)
        {
            if (type == EventType.RENDER_TEXT)
            {
                // Cast the IEventData to TextRenderInfo
                TextRenderInfo renderInfo = (TextRenderInfo)data;

                //See if the current chunk contains the text
                var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), specificText);
                //If not found bail
                if (startPosition < 0) { return; }

                // Get the bounding box for the text
                textRect = renderInfo.GetDescentLine().GetBoundingRectangle();

            }
        }

        public ICollection<EventType> GetSupportedEvents()
        {
            return new List<EventType> { EventType.RENDER_TEXT };
        }
    }

When I'm Working with pdf generated from CorelDraw I'm able to get the whole string ("Test") in EventData. but when I'm using PDF generated from Microsoft Word I'm getting chunks like "T" and "est".

Am I using the wrong procedure to get the location of the string or can I make some changes in Microsoft Word when generating PDF? If nothing of this can't be done, I'll need to make some code to concatenate the letters to get searched string and get location. I don't know how to attack this problem. Can somebody tell me, in general, how can this be solved?


Solution

  • PDF content is drawn according to the drawing instructions in several kinds of content streams therein. These instructions may draw text in different manners, they may draw a whole text line in one instruction, they may draw it using separate instructions for each word, ... They may even draw it using separate instructions for each letter.

    iText forwards you each string argument of a text drawing instruction in a separate event. Thus, your observation essentially means that CorelDraw draws larger line pieces at once than Word does. Most likely Word draws "T" and "est" in separate instructions to apply kerning in-between.

    So essentially you indeed

    need to make some code to concatenate the letters to get searched string and get location.

    In that context you wonder

    how to attack this problem. Can somebody tell me, in general, how can this be solved?

    iText includes example event listeners for text extraction, e.g. the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy. As you are trying to not only extract the text but also locations, the newer RegexBasedLocationExtractionStrategy might be of special interest to you.

    If you cannot use the RegexBasedLocationExtractionStrategy as is, you can at least look at its code for inspiration. iText is open source after all...