I want to extract text fields content from pdf files which have text fields that I need to bring into my Winforms project. Searching I found reference to iTextSharp but then saw that it is replaced with iText7 but everything I read refers only to it being used in C#. My winforms project is vb. Any pointers as to what would be my best option to achieve getting that data into my project is much appreciated
To extract text from a PDF file using itext7
, try the following:
Pre-requisite: Download/install NuGet package itext7
Add the following Imports statements:
Imports iText.Kernel.Pdf
Imports iText.Kernel.Pdf.Canvas.Parser.Listener
Imports iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor
GetTextFromPdf:
Public Function GetTextFromPdf(filename As String) As String
Dim sb As System.Text.StringBuilder = New System.Text.StringBuilder()
Using doc As PdfDocument = New PdfDocument(New PdfReader(filename))
'Dim strategy As LocationTextExtractionStrategy = New LocationTextExtractionStrategy()
For i As Integer = 1 To doc.GetNumberOfPages() Step 1
Dim page = doc.GetPage(i)
'Dim text = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy)
Dim text = GetTextFromPage(page)
sb.AppendLine(text)
Next
End Using
Return sb.ToString()
End Function
The code for GetTextFromPdf is adapted from here.
Update:
The code below shows how to read the field names and field values from an AcroForm in a Pdf document:
Add the following Imports statements:
Imports iText.Forms
Imports iText.Kernel.Pdf
GetTextFromPdfFields
Public Function GetTextFromPdfFields(filename As String) As String
Dim sb As System.Text.StringBuilder = New System.Text.StringBuilder()
'create new instance
Using doc As PdfDocument = New PdfDocument(New PdfReader(filename))
'get AcroForm from document
Dim form As PdfAcroForm = PdfAcroForm.GetAcroForm(doc, True)
'get form fields
Dim fieldDict As IDictionary(Of String, Fields.PdfFormField) = form.GetFormFields()
'loop through form fields
For Each kvp As KeyValuePair(Of String, Fields.PdfFormField) In fieldDict
Dim type As PdfName = form.GetField(kvp.Key).GetFormType()
Dim fieldName As PdfString = form.GetField(kvp.Key).GetFieldName()
Dim fieldValue As String = form.GetField(kvp.Key).GetValueAsString()
If fieldName IsNot Nothing Then
'append data to instance of StringBuilder
sb.AppendLine("Type: " & type.ToString() & " FieldName: " & fieldName.ToString() & " Value: " & fieldValue)
End If
Next
End Using
Return sb.ToString()
End Function
**Note: The code for GetTextFromPdfFields is adapted from here.