I wrote the following function to read the text out of a PDF file. It is pretty close, but I'm just not familiar enough with all the op codes to get the line spacing right. For example, I'm currently inserting a new line when I see "ET" but that doesn't seem quite right since it may just be the end of a text run, mid line. Could someone help me adjust the parsing? My goal is something similar to Adobe Reader's "Save as other" --> "Text"
Public Function ReadPDFFile(filePath As String,
Optional maxLength As Integer = 0) As String
Dim sbContents As New StringBuilder
Dim cArrayType As Type = GetType(CArray)
Dim cCommentType As Type = GetType(CComment)
Dim cIntegerType As Type = GetType(CInteger)
Dim cNameType As Type = GetType(CName)
Dim cNumberType As Type = GetType(CNumber)
Dim cOperatorType As Type = GetType(COperator)
Dim cRealType As Type = GetType(CReal)
Dim cSequenceType As Type = GetType(CSequence)
Dim cStringType As Type = GetType(CString)
Dim opCodeNameType As Type = GetType(OpCodeName)
Dim ReadObject As Action(Of CObject) = Sub(obj As CObject)
Dim objType As Type = obj.GetType
Select Case objType
Case cArrayType
Dim arrObj As CArray = DirectCast(obj, CArray)
For Each member As CObject In arrObj
ReadObject(member)
Next
Case cOperatorType
Dim opObj As COperator = DirectCast(obj, COperator)
Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName)
Case "ET", "Tx"
sbContents.Append(vbNewLine)
Case "Tj", "TJ"
For Each operand As CObject In opObj.Operands
ReadObject(operand)
Next
Case "QuoteSingle", "QuoteDbl"
sbContents.Append(vbNewLine)
For Each operand As CObject In opObj.Operands
ReadObject(operand)
Next
Case Else
'Do Nothing
End Select
Case cSequenceType
Dim seqObj As CSequence = DirectCast(obj, CSequence)
For Each member As CObject In seqObj
ReadObject(member)
Next
Case cStringType
sbContents.Append(DirectCast(obj, CString).Value)
Case cCommentType, cIntegerType, cNameType, cNumberType, cRealType
'Do Nothing
Case Else
Throw New NotImplementedException(obj.GetType().AssemblyQualifiedName)
End Select
End Sub
Using pd As PdfDocument = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly)
For Each page As PdfPage In pd.Pages
ReadObject(ContentReader.ReadContent(page))
If maxLength > 0 And sbContents.Length >= maxLength Then
If sbContents.Length > maxLength Then
sbContents.Remove(maxLength - 1, sbContents.Length - maxLength)
End If
Exit For
End If
sbContents.Append(vbNewLine)
Next
End Using
Return sbContents.ToString
End Function
Your code is ignoring almost all operations which change the line. You do consider ' and " which most often imply a change of line but which in the wild are seldom used.
Inside a text object (BT .. ET) you, therefore, should also look out for
To interpret ', " and T* correctly, you should also look out for
If you find multiple text objects (BT .. ET .. BT .. ET), the second one is not necessarily on a new line. You should look out for the special graphics state operators between them:
Your code is ignoring all numeric arguments to the operations. You should not ignore them, especially:
0 -20 Td
starts a new line 20 units down, 20 0 Td
remains on the same line and merely starts drawing text 20 units right of the former line start.Your code is assuming the Value
of CString
instances to already contain Unicode encoded character data. This assumption in general is incorrect, the encoding used in PDF strings drawn in text drawing operations is ruled by the font. Thus, you furthermore should also look out for
For details you should first and foremost study the PDF specification ISO-32000-1, especially chapter 9 Text with a solid background from chapter 8 Graphics.