Search code examples
vb.netpdfgraphicsitextmarkup

iTextSharp get reference to a graphic markup


I have been researching for a couple of hours on how to do this but have hit a brick wall. I have a PDF file and one of the objects is a North arrow. It is a simple line graphic (I believe they are called Graphic Markups in Acrobat) that will denote which way is "up". I want to read that line graphic and determine its rotation. First step I took is to see if I could enumerate the contents of the PDF with this code:

Imports it = iTextSharp.text
Imports ip = iTextSharp.text.pdf

Dim pdfRdr As New ip.PdfReader("C:\city.pdf")
Dim page As ip.PdfDictionary = pdfRdr.GetPageN(1)
Dim objectReference As ip.PdfIndirectReference = CType(page.Get(ip.PdfName.CONTENTS), ip.PdfIndirectReference)
Dim stream As ip.PRStream = CType(ip.PdfReader.GetPdfObject(objectReference), ip.PRStream)
Dim streamBytes() As Byte = ip.PdfReader.GetStreamBytes(stream)
Dim tokenizer As New ip.PRTokeniser(New ip.RandomAccessFileOrArray(streamBytes))

'Loop through each PDf token
While tokenizer.NextToken
     Debug.Print("token of type={0} and value={1}", tokenizer.TokenType.ToString, tokenizer.StringValue)
End While

I do get some data back but am afraid I just don't understand how to decipher it.

token of type=OTHER and value=q
token of type=NUMBER and value=0.86275
token of type=NUMBER and value=0
token of type=NUMBER and value=0
token of type=NUMBER and value=0.86275
token of type=NUMBER and value=54
token of type=NUMBER and value=30
token of type=OTHER and value=cm
token of type=NAME and value=Fm0
token of type=OTHER and value=Do
token of type=OTHER and value=Q
token of type=OTHER and value=q
token of type=NUMBER and value=1
token of type=NUMBER and value=0
token of type=NUMBER and value=0
token of type=NUMBER and value=1
token of type=NUMBER and value=54
token of type=NUMBER and value=18
token of type=OTHER and value=cm
token of type=NAME and value=Fm1
token of type=OTHER and value=Do
token of type=OTHER and value=Q

I have skinnied down the PDF to show only the graphic that I am interested in. enter image description here enter image description here

test file is here https://drive.google.com/file/d/1dYFkvLMvznsx6sN-1GsNZVIBtDpgzwCU/view?usp=sharing

Am I going down the right path or is there a different way to get a reference to a graphic markup?


Solution

  • In contrast to the initial impression, the north arrow is not in an annotation of the PDF but instead part of the regular page content. (@Jon created his answer under that initial impression.)

    In the PDF shared by the OP, the arrow is part of the immediate page content. In the Adobe Acrobat screenshot shared by the OP, on the other hand, the arrow appears to be in a form XObject (which in turn would be referenced from the immediate page content).

    The following approach should retrieve the vector graphics instructions for either case.

    You can retrieve the vector graphics instructions drawing the arrow using the iText parser framework.

    Using a current iText 5.5.x, for example, you need to implement IExtRenderListener and use that implementation in a PdfReaderContentParser execution, e.g.:

    Public Class VectorParser
        Implements IExtRenderListener
    
        Public Sub ModifyPath(renderInfo As PathConstructionRenderInfo) Implements IExtRenderListener.ModifyPath
            pathInfos.Add(renderInfo)
        End Sub
    
        Public Function RenderPath(renderInfo As PathPaintingRenderInfo) As parser.Path Implements IExtRenderListener.RenderPath
            Dim GraphicsState As GraphicsState = getGraphicsState(renderInfo)
            Dim ctm As Matrix = GraphicsState.GetCtm()
    
            If (Not (renderInfo.Operation And PathPaintingRenderInfo.FILL) = 0) Then
                Console.Write("FILL ({0}) ", ToString(GraphicsState.FillColor))
                If (Not (renderInfo.Operation And PathPaintingRenderInfo.STROKE) = 0) Then
                    Console.Write("and ")
                End If
            End If
    
            If (Not (renderInfo.Operation And PathPaintingRenderInfo.STROKE) = 0) Then
                Console.Write("STROKE ({0}) ", ToString(GraphicsState.StrokeColor))
            End If
    
            Console.Write("the path ")
    
            For Each pathConstructionRenderInfo In pathInfos
                Select Case pathConstructionRenderInfo.Operation
                    Case PathConstructionRenderInfo.MOVETO
                        Console.Write("move to {0} ", ToString(transform(ctm, pathConstructionRenderInfo.SegmentData)))
                    Case PathConstructionRenderInfo.CLOSE
                        Console.Write("close {0} ", ToString(transform(ctm, pathConstructionRenderInfo.SegmentData)))
                    Case PathConstructionRenderInfo.CURVE_123
                        Console.Write("curve123 {0} ", ToString(transform(ctm, pathConstructionRenderInfo.SegmentData)))
                    Case PathConstructionRenderInfo.CURVE_13
                        Console.Write("curve13 {0} ", ToString(transform(ctm, pathConstructionRenderInfo.SegmentData)))
                    Case PathConstructionRenderInfo.CURVE_23
                        Console.Write("curve23 {0} ", ToString(transform(ctm, pathConstructionRenderInfo.SegmentData)))
                    Case PathConstructionRenderInfo.LINETO
                        Console.Write("line to {0} ", ToString(transform(ctm, pathConstructionRenderInfo.SegmentData)))
                    Case PathConstructionRenderInfo.RECT
                        Console.Write("rectangle {0} ", ToString(transform(ctm, expandRectangleCoordinates(pathConstructionRenderInfo.SegmentData))))
                End Select
            Next
    
            Console.WriteLine()
    
            pathInfos.Clear()
            Return Nothing
        End Function
    
        Public Sub ClipPath(rule As Integer) Implements IExtRenderListener.ClipPath
        End Sub
    
        Public Sub BeginTextBlock() Implements IRenderListener.BeginTextBlock
        End Sub
    
        Public Sub RenderText(renderInfo As TextRenderInfo) Implements IRenderListener.RenderText
        End Sub
    
        Public Sub EndTextBlock() Implements IRenderListener.EndTextBlock
        End Sub
    
        Public Sub RenderImage(renderInfo As ImageRenderInfo) Implements IRenderListener.RenderImage
        End Sub
    
        Function expandRectangleCoordinates(rectangle As IList(Of Single)) As List(Of Single)
            If rectangle.Count < 4 Then
                Return New List(Of Single)
            End If
    
            Return New List(Of Single)() From
            {
                rectangle(0), rectangle(1),
                rectangle(0) + rectangle(2), rectangle(1),
                rectangle(0) + rectangle(2), rectangle(1) + rectangle(3),
                rectangle(0), rectangle(1) + rectangle(3)
            }
        End Function
    
        Function transform(ctm As Matrix, coordinates As IList(Of Single)) As List(Of Single)
            Dim result As List(Of Single) = New List(Of Single)
            If Not coordinates Is Nothing Then
                For i = 0 To coordinates.Count - 1 Step 2
                    Dim vector As Vector = New Vector(coordinates(i), coordinates(i + 1), 1)
                    vector = vector.Cross(ctm)
                    result.Add(vector(Vector.I1))
                    result.Add(vector(Vector.I2))
                Next
            End If
            Return result
        End Function
    
        Public Function ToString(coordinates As IList(Of Single)) As String
            Dim result As StringBuilder = New StringBuilder()
            result.Append("[ ")
            For i = 0 To coordinates.Count - 1
                result.Append(coordinates(i))
                result.Append(" ")
            Next
            result.Append("]")
            Return result.ToString()
        End Function
    
        Public Function ToString(baseColor As BaseColor) As String
            If (baseColor Is Nothing) Then
                Return "DEFAULT"
            End If
            Return String.Format("{0},{1},{2}", baseColor.R, baseColor.G, baseColor.B)
        End Function
    
        Function getGraphicsState(renderInfo As PathPaintingRenderInfo) As GraphicsState
            Dim gsField As Reflection.FieldInfo = GetType(PathPaintingRenderInfo).GetField("gs", Reflection.BindingFlags.NonPublic Or Reflection.BindingFlags.Instance)
            Return CType(gsField.GetValue(renderInfo), GraphicsState)
        End Function
    
        Dim pathInfos As List(Of PathConstructionRenderInfo) = New List(Of PathConstructionRenderInfo)
    End Class
    

    which used like this

    Using pdfReader As New PdfReader("test.pdf")
        Dim extRenderListener As IExtRenderListener = New VectorParser
    
        For page = 1 To pdfReader.NumberOfPages
            Console.Write(vbCrLf + "Page {0}" + vbCrLf + "====" + vbCrLf, page)
            Dim parser As PdfReaderContentParser = New PdfReaderContentParser(pdfReader)
            parser.ProcessContent(page, extRenderListener)
        Next
    End Using
    

    for your shared document returns

    Page 1
    ====
    STROKE (0,0,255) the path move to [ 277,359 434,2797 ] line to [ 311,5242 434,2797 ] 
    STROKE (0,0,255) the path move to [ 277,3591 434,2797 ] line to [ 315,0443 424,1336 ] 
    STROKE (0,0,255) the path move to [ 304,2772 425,376 ] line to [ 304,4842 426,6183 ] 
    STROKE (0,0,255) the path move to [ 304,6913 426,2042 ] line to [ 310,075 425,376 ] 
    STROKE (0,0,255) the path move to [ 304,6913 426,8254 ] line to [ 307,5902 425,9972 ] 
    FILL (0,0,255) the path move to [ 303,656 425,3759 ] line to [ 303,656 425,3759 ] line to [ 306,1407 425,1689 ] line to [ 306,1407 425,1689 ] 
    STROKE (0,0,255) the path move to [ 303,656 425,376 ] line to [ 303,656 425,376 ] line to [ 306,1407 425,1689 ] line to [ 306,1407 425,1689 ] close [ ] 
    FILL (0,0,255) the path move to [ 306,969 424,9618 ] line to [ 306,969 424,9618 ] line to [ 309,4538 424,7548 ] line to [ 309,4538 424,7548 ] 
    STROKE (0,0,255) the path move to [ 306,969 424,9619 ] line to [ 306,969 424,9619 ] line to [ 309,4538 424,7548 ] line to [ 309,4538 424,7548 ] close [ ] 
    FILL (0,0,255) the path move to [ 309,8679 424,9618 ] line to [ 309,8679 424,9618 ] line to [ 312,3527 424,5477 ] line to [ 312,3527 424,5477 ] 
    STROKE (0,0,255) the path move to [ 309,868 424,9619 ] line to [ 309,868 424,9619 ] line to [ 312,3527 424,5477 ] line to [ 312,3527 424,5477 ] close [ ] 
    STROKE (0,0,255) the path move to [ 313,1809 424,3407 ] line to [ 314,8374 424,1336 ] 
    STROKE (0,0,255) the path move to [ 304,2772 425,7901 ] line to [ 309,8679 424,9619 ] line to [ 312,9738 424,7548 ] 
    STROKE (0,0,255) the path move to [ 304,2772 425,9972 ] line to [ 309,8679 425,1689 ] line to [ 311,5244 424,9619 ] 
    STROKE (0,0,255) the path move to [ 304,6914 426,8254 ] line to [ 315,0445 424,1336 ] 
    STROKE (0,0,255) the path move to [ 311,7315 435,7292 ] line to [ 311,7315 432,8303 ] 
    STROKE (0,0,255) the path move to [ 321,2564 434,2797 ] line to [ 315,4587 434,2797 ] 
    STROKE (0,0,255) the path move to [ 315,4586 434,2797 ] line to [ 311,7315 434,2797 ] 
    STROKE (0,0,255) the path move to [ 311,7315 434,6938 ] line to [ 317,7363 434,0727 ] line to [ 311,7315 433,6585 ] 
    STROKE (0,0,255) the path move to [ 311,7315 434,4868 ] line to [ 314,8374 434,2797 ] line to [ 311,7315 434,2797 ] 
    STROKE (0,0,255) the path move to [ 310,6963 436,1433 ] line to [ 317,3222 434,9009 ] line to [ 322,2917 434,2797 ] line to [ 317,3222 433,6585 ] line to [ 310,6963 432,6232 ] 
    STROKE (0,0,255) the path move to [ 311,7315 435,5221 ] line to [ 317,3222 434,6938 ] line to [ 321,0493 434,2797 ] line to [ 317,3222 433,8656 ] line to [ 311,7315 433,0374 ] 
    STROKE (0,0,255) the path move to [ 311,7315 435,108 ] line to [ 317,3222 434,4868 ] line to [ 319,3928 434,2797 ] line to [ 317,3222 434,2797 ] line to [ 311,7315 433,4515 ]
    

    This looks like a lot of instructions for a simple arrow, but zooming into the PDF one sees that the arrow indeed is constructed of numerous small lines:

    Screenshot

    In particular the arrow heads look like someone created them by hand using line segments of different lengths and widths.


    The code above essentially is a port of the anonymous ExtRenderListener implementation for Java and iText 5.5.x in this answer.

    It is equally simple to implement this using iText 7.


    As an aside: Unfortunately the instructions for drawing the arrow are not specifically marked; if there are other vector graphics on the same page, you'll have to filter the results returned by the parser by some specific criteria, e.g. the color (in the case at hand pure RGB blue) or the approximate coordinate range (e.g. inside a given x and y coordinate range only).