Search code examples
c#pdfitext7vertical-text

IText7 How to get locations of texts in format of classical chinese documents?


What I mean the format of classical chinese document?

The paragraph is composed of lines, the first line is at right most, the second line is at left of the first line, etc. The line is composed of characters, the first character is at top most, the second character is at below of the first character, etc.

I have file LR-10709-24-25.pdf in format of classical chinese document, for some reason I need the locations of texts for analysis.

  1. Apply the program(see below) to history-2-3.pdf which is in ordinary english like format, got correct result:

enter image description here

  1. Apply the same program to LR-10709-24-25.pdf, got the very wrong result:

enter image description here

  1. I guess it is about coordinate, current transform matrix, textMatrix, TextRenderInfo, but I need help to understand these things by this problem.

Here is my program

using iText.Kernel.Colors;
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System.Text;

string srcFileName = "LR-10709-24-25.pdf";
string destFileName = "LR-10709-24-25-enclose.pdf";
//string srcFileName = "history-2-3.pdf";
//string destFileName = "history-2-3-enclose.pdf";
PdfDocument pdfDoc = new PdfDocument(new PdfReader(srcFileName), new PdfWriter(destFileName));
StringBuilder sb = new StringBuilder();
for (int i = 0; i < pdfDoc.GetNumberOfPages(); i++)
{
    SimplePositionalTextEventListener listener = new SimplePositionalTextEventListener();
    new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetPage(i + 1));
    List<SimpleTextWithRectangle> result = listener.GetResultantTextWithPosition();
    int R = 0, G = 0, B = 0;
    foreach (SimpleTextWithRectangle textWithRectangle in result)
    {
        R += 40; R = R % 256;
        G += 20; G = G % 256;
        B += 80; B = B % 256;
        PdfCanvas canvas = new PdfCanvas(pdfDoc.GetPage(i + 1));
        canvas.SetStrokeColor(new DeviceRgb(R, G, B));
        var rect = textWithRectangle.GetRectangle();
        canvas.Rectangle(rect);
        canvas.Stroke();
    }
}
pdfDoc.Close();

Console.WriteLine("Press any key to continue!");
Console.ReadKey();

class SimpleTextWithRectangle
{
    private Rectangle rectangle;
    private string text;

    public SimpleTextWithRectangle(Rectangle rectangle, String text)
    {
        this.rectangle = rectangle;
        this.text = text;
    }

    public Rectangle GetRectangle()
    {
        return rectangle;
    }
    public string GetText()
    {
        return text;
    }
}

class SimplePositionalTextEventListener : IEventListener
{
    private List<SimpleTextWithRectangle> textWithRectangleList = new List<SimpleTextWithRectangle>();
    private void renderText(TextRenderInfo renderInfo)
    {
        if (renderInfo.GetText().Trim().Length == 0)
            return;
        LineSegment ascent = renderInfo.GetAscentLine();
        LineSegment descent = renderInfo.GetDescentLine();
        float initX = descent.GetStartPoint().Get(0);
        float initY = descent.GetStartPoint().Get(1);
        float endX = ascent.GetEndPoint().Get(0);
        float endY = ascent.GetEndPoint().Get(1);

        Rectangle rectangle = new Rectangle(initX, initY, endX - initX, endY - initY);

        SimpleTextWithRectangle textWithRectangle = new SimpleTextWithRectangle(rectangle, renderInfo.GetText());
        textWithRectangleList.Add(textWithRectangle);
    }

    public List<SimpleTextWithRectangle> GetResultantTextWithPosition()
    {
        return textWithRectangleList;
    }
    public void EventOccurred(IEventData data, EventType type)
    {
        renderText((TextRenderInfo)data);
    }

    public ICollection<EventType> GetSupportedEvents()
    {
        return new List<EventType> { EventType.RENDER_TEXT };
    }
}

and two pdf files history-2-3.pdf LR-10709-24-25.pdf

Update

I still believe TextRenderInfo contains necessary informations of locations of text.

I update my program(below).

  1. Use statement Utils.Enclose("LR-10709-24-25.pdf", "LR-10709-24-25-enclose.pdf"); to create LR-10709-24-25-enclose.pdf file with uncorrect rectangle locations as before.

  2. Use statement Utils.DoubleWrite("LR-10709-24-25.pdf", "LR-10709-24-25-dup.pdf") to create LR-10709-24-25-dup.pdf file, which use PdfCanvas.ShowText mathod to write text(in red color) with information of TextRenderInfo.TextMatrix over the original content. The result top portion of page 1 is : enter image description here which is not correct. The result of page 2 is VERY interesting. enter image description here Looking carefully the image. Character in red is more or less tiny below the same character in black. The different distances may be causeed by characterspacing/wordspacing. The fifth(count from right) vertical line is most vision clear.

  3. Use statement Utils.AppendPage("LR-10709-24-25.pdf", "LR-10709-24-25-append.pdf"); to create LR-10709-24-25-append.pdf file, which append two pages using PdfCanvas.ShowText mathod to write text(in red color) with information of TextRenderInfo.TextMatrix. Page 1/2 is the same as page 1/2 of original pdf. Here new page 3(using TextRenderInfo of original page1) image: enter image description here

and new page 4 image:

enter image description here

Very interesting, it looks like: First combine original two pages side by side, page 2 at left, page 1 at right. The new page 3 is the combined image rotate 90 degree. The new page 4 is the combined image cropped by shorter width of page.

So, it seems that TextRenderInfo contains all necessary informations for text locations. But as a beginner of iText, it is beyond my knowledge about Coordinate, TextMatrix, Ctm, PdfFont, CropBox....

Hope some one can help me!

My updated program is:

using iText.Kernel.Colors;
using iText.Kernel.Font;
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;

Utils.Enclose("LR-10709-24-25.pdf", "LR-10709-24-25-enclose.pdf");
Utils.DoubleWrite("LR-10709-24-25.pdf", "LR-10709-24-25-dup.pdf");
Utils.AppendPage("LR-10709-24-25.pdf", "LR-10709-24-25-append.pdf");

Console.WriteLine("Press any key to continue 123!");
Console.ReadKey();

class Utils
{
    public static void Enclose(string srcFileName, string destFileName)
    {
        //string srcFileName = "LR-10709-24-25.pdf";
        //string destFileName = "LR-10709-24-25-enclose.pdf";
        //string srcFileName = "history-2-3.pdf";
        //string destFileName = "history-2-3-enclose.pdf";
        PdfDocument pdfDoc = new PdfDocument(new PdfReader(srcFileName), new PdfWriter(destFileName));
        for (int i = 0; i < pdfDoc.GetNumberOfPages(); i++)
        {
            SimpleTextEventListener listener = new SimpleTextEventListener();
            new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetPage(i + 1));
            List<SimpleTextInfo> textInfos = listener.GetTextInfos();
            PdfCanvas canvas = new PdfCanvas(pdfDoc.GetPage(i + 1));
            int R = 0, G = 0, B = 0;
            foreach (SimpleTextInfo textInfo in textInfos)
            {
                R += 40; R = R % 256;
                G += 20; G = G % 256;
                B += 80; B = B % 256;
                canvas.SetStrokeColor(new DeviceRgb(R, G, B));
                var rect = textInfo.GetRectangle();
                canvas.Rectangle(rect);
                canvas.Stroke();
            }
        }
        pdfDoc.Close();
    }
    public static void DoubleWrite(string srcFileName, string destFileName)
    {
        PdfDocument pdfDoc = new PdfDocument(new PdfReader(srcFileName), new PdfWriter(destFileName));
        for (int i = 0; i < pdfDoc.GetNumberOfPages(); i++)
        {
            SimpleTextEventListener listener = new SimpleTextEventListener();
            new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetPage(i + 1));
            List<SimpleTextInfo> textInfos = listener.GetTextInfos();
            PdfCanvas canvas = new PdfCanvas(pdfDoc.GetPage(i + 1));
            foreach (SimpleTextInfo textInfo in textInfos)
            {
                canvas.SaveState();
                canvas.SetColor(ColorConstants.RED, true);
                canvas.SetFontAndSize(textInfo.GetFont(), textInfo.GetFontSize());
                canvas.BeginText();
                Matrix textMatrix = textInfo.GetTextMatrix();
                canvas.SetTextMatrix(textMatrix.Get(0), textMatrix.Get(1), textMatrix.Get(3), textMatrix.Get(4), textMatrix.Get(6), textMatrix.Get(7))
                            .ShowText(textInfo.GetText());
                canvas.EndText();
                canvas.RestoreState();
            }
        }
        pdfDoc.Close();
    }
    public static void AppendPage(string srcFileName, string destFileName)
    {
        PdfDocument pdfDoc = new PdfDocument(new PdfReader(srcFileName), new PdfWriter(destFileName));
        int n = pdfDoc.GetNumberOfPages();
        for (int i = 0; i < n; i++)
        {
            SimpleTextEventListener listener = new SimpleTextEventListener();
            var page = pdfDoc.GetPage(i + 1);
            new PdfCanvasProcessor(listener).ProcessPageContent(page);
            List<SimpleTextInfo> textInfos = listener.GetTextInfos();
            var newPage = pdfDoc.AddNewPage();
            PdfCanvas canvas = new PdfCanvas(newPage);
            foreach (SimpleTextInfo textInfo in textInfos)
            {
                canvas.SaveState();
                canvas.SetColor(ColorConstants.RED, true);
                canvas.SetFontAndSize(textInfo.GetFont(), textInfo.GetFontSize());
                canvas.BeginText();
                Matrix textMatrix = textInfo.GetTextMatrix();
                canvas.SetTextMatrix(textMatrix.Get(0), textMatrix.Get(1), textMatrix.Get(3), textMatrix.Get(4), textMatrix.Get(6), textMatrix.Get(7))
                            .ShowText(textInfo.GetText());
                canvas.EndText();
                canvas.RestoreState();
            }
        }
        pdfDoc.Close();
    }
}

class SimpleTextInfo
{
    private Rectangle rectangle;
    private string text;
    private PdfFont font;
    private float fontSize;
    private Matrix textMatrix;
    private Matrix canvasMatrix;
    public SimpleTextInfo(Rectangle rectangle, String text, PdfFont font, float fontSize, Matrix textMatrix, Matrix canvasMatrix)
    {
        this.rectangle = rectangle;
        this.text = text;
        this.font = font;
        this.fontSize = fontSize;
        this.textMatrix = textMatrix;
        this.canvasMatrix = canvasMatrix;
    }
    public Rectangle GetRectangle()
    {
        return rectangle;
    }
    public string GetText()
    {
        return text;
    }
    public PdfFont GetFont()
    {
        return font;
    }
    public float GetFontSize()
    {
        return fontSize;
    }
    public Matrix GetTextMatrix()
    {
        return textMatrix;
    }
    public Matrix GetCanvasMatrix()
    {
        return canvasMatrix;
    }
}

class SimpleTextEventListener : IEventListener
{
    private List<SimpleTextInfo> simpleTextInfos = new List<SimpleTextInfo>();
    private void renderText(TextRenderInfo renderInfo)
    {
        if (renderInfo.GetText().Trim().Length == 0)
            return;
        LineSegment ascent = renderInfo.GetAscentLine();
        LineSegment descent = renderInfo.GetDescentLine();
        float initX = descent.GetStartPoint().Get(0);
        float initY = descent.GetStartPoint().Get(1);
        float endX = ascent.GetEndPoint().Get(0);
        float endY = ascent.GetEndPoint().Get(1);
        Rectangle rectangle = new Rectangle(initX, initY, endX - initX, endY - initY);
        SimpleTextInfo textInfo = new SimpleTextInfo(rectangle, renderInfo.GetText(), renderInfo.GetFont(), renderInfo.GetFontSize(), renderInfo.GetTextMatrix(), renderInfo.GetGraphicsState().GetCtm());
        simpleTextInfos.Add(textInfo);
    }
    public List<SimpleTextInfo> GetTextInfos()
    {
        return simpleTextInfos;
    }
    public void EventOccurred(IEventData data, EventType type)
    {
        renderText((TextRenderInfo)data);
    }
    public ICollection<EventType> GetSupportedEvents()
    {
        return new List<EventType> { EventType.RENDER_TEXT };
    }
}

Two result pdf files : LR-10709-24-25-dup.pdf and LR-10709-24-25-append.pdf


Solution

  • iText text parsing currently only properly supports horizontal writing mode, not vertical writing mode. It handles all text as if it was written in horizontal mode.

    In your second PDF most fonts are configured to use vertical writing mode, only some for the horizontal one.

    You can easily recognize the characters written in horizontal mode, they are properly enclosed in their respective bounding box.

    The (incorrect) bounding box of a text chunk drawn in vertical mode on the other hand starts atop its first character at the middle of its width and extends to the right for as many character widths as the chunk consists of characters downwards.

    To extend iText to also support vertical writing mode, you have to fix the PdfCanvasProcessor (in particular its displayPdfString method that always advances the textMatrix horizontally) and the TextRenderInfo (that calculates dimensions always as if the text was written horizontally). (At least that's what I see at first glance.)