Search code examples
vb.netpdfsplititext

String split for detection of a text page change from PDF


i'm trying to analyse a PDF document with itextsharp library...the final intent is read all text and split it for every line.

To do this, i use a split function of the readed text... i have complete text in a string var as this.

 Dim RigheTesto As String()
 RigheTesto = testoEstrapolato.Split({vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries)

Split function work fine and i obtain a string array like "Data type: value", one array for every line from original file ...

... but when split encounter a change of page (in original PDF) don't understand is a different line and it unites to previous ...

Do you know how solve this problem please ?

Thanks for your time!


Solution

  • The following shows how to extract text from a PDF file using NuGet package iTextSharp (it's been tested using v5.5.13.2).

    Download/install NuGet package iTextSharp

    Create a class (name: PdfPageInfo.vb)

    Public Class PdfPageInfo
        Public Property PageNumber As Integer
        Public Property Lines As List(Of String) = New List(Of String)
    End Class
    
    

    Create a module (name: HelperiTextSharp.vb)

    Imports iTextSharp.text.pdf
    Imports iTextSharp.text.pdf.parser
    
    Module HelperiTextSharp
        Public Function ExtractText(filename As String) As List(Of PdfPageInfo)
            Dim pageInfoList As List(Of PdfPageInfo) = New List(Of PdfPageInfo)
    
            Using reader As PdfReader = New PdfReader(filename)
                For i As Integer = 1 To reader.NumberOfPages Step 1
    
                    'create new instance
                    Dim pageInfo As PdfPageInfo = New PdfPageInfo()
    
                    'set value
                    pageInfo.PageNumber = i
    
                    'get text from PDF page
                    Dim pageText As String = PdfTextExtractor.GetTextFromPage(reader, i)
    
                    'split on newline and set value
                    pageInfo.Lines = pageText.Split(New String() {vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries).ToList()
    
                    'add 
                    pageInfoList.Add(pageInfo)
                Next
            End Using
    
            Return pageInfoList
        End Function
    End Module
    

    Usage:

    Dim ofd As OpenFileDialog = New OpenFileDialog()
    ofd.Filter = "PDF files(*.pdf)|*.pdf"
    
    If ofd.ShowDialog = DialogResult.OK Then
        Dim pdfPageInfoList As List(Of PdfPageInfo) = HelperiTextSharp.ExtractText(ofd.FileName)
    
        For Each pInfo As PdfPageInfo In pdfPageInfoList
            Debug.WriteLine("Page Number: " & pInfo.PageNumber.ToString())
    
            For i As Integer = 0 To pInfo.Lines.Count - 1 Step 1
                Debug.WriteLine("[" & i & "]: " & pInfo.Lines(i))
            Next
    
            Debug.WriteLine("---------------------------------" & vbCrLf)
        Next
    End If
    

    Resource: