Search code examples
regexpowershellpdfitext

Find specific fields in a PDF using PowerShell, Regex, itextsharp.dll


I'm very much a newbie when it comes to RegEx, but have been trying for the last few hours to figure out how to parse some data from a PDF using PowerShell and itextsharp.dll. I was going to post in the itextsharp forums, but I didn't actually see a place for help there. Just a bunch of how-to's for people that already understand RegEx well.

The PDF table looks like this: enter image description here

The itextsharp.dll output looks like this:

Selection Criteria Report parameters
Select all Bottles where
Date Loaded - Date/Time (Bottle) is after or equal to '11/20/2015 15:50'
AND
Date Loaded - Date/Time (Bottle) is before or equal to '11/20/2015
16:10'
N/A
Unit # Status Determined Bottle ID Time to Find Cell
=W00000000000001 Negative 11/25/2015 16:08 AAAACNSJ 5 2D55
=W00000000000002 Negative 11/25/2015 16:08 AAAACNSA 5 2D56
1291231 Negative 11/25/2015 16:08 AAAACNB 5 2D57
=W00000000000003 Positive 11/25/2015 16:08 AAAACNS9 5 2D58
1981231 Negative 11/25/2015 16:09 AAAACNSG 5 2D59
=W00000000000004 Negative 11/25/2015 16:10 AAAACNS7 5 2D60
Report
Reviewed By: Printed for manual signature
Page 1 of 1 11/25/2015 16:15

I've been using the following code and various different RegEx expressions to try and parse only the table data out and set each of the columns to a variable. I've omitted all of the different things I've tried because there has just been so much and I really don't know what I'm doing because of the way the data is.

 for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{

    $strategy = new-object  'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'            
    $currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
    [string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default  , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));    
    $Line = $text -Split "`n"
    $i = 0
    Do {    
        If ($Line[$i] -match '(?m)^(?<unit_id>=?\w+)\s+(?<status>\w+)\s+(?<determined>\d{2}\/\d{2}\/\d{4}\s+‌​\d{2}:\d{2})\s+(?<bottle_id>\w+)\s+(?<time_to_find>\d)+\s+(?<cell>\w+)$') {
            Write-Host $Line[$i]
        }
        $i = $i + 1
    }
    While ($Line[$i])
}
$Reader.Close();

Is there anyone out there that could assist me with getting all these columns set to variables properly? Any help would be greatly appreciated. Thanks!


Solution

  • Here is a sample regex that should parse the 1-line string well:

    $text = '=W03651532551000 Negative 11/25/2015 16:08 PAGYCNQ6 5 2D56'
    $text -match '^(?<unit_id>=?\w+)\s+(?<status>\w+)\s+(?<determined>[\/\d\s:]+)\s+(?<bottle_id>\w+)\s+(?<time_to_find>\d+)\s+(?<cell>\w+)$'
    $matches
    

    Output:

    Name                           Value
    ----                           -----
    determined                     11/25/2015 16:08
    cell                           2D56
    status                         Negative
    bottle_id                      PAGYCNQ6
    time_to_find                   5
    unit_id                        =W03651532551000
    0                              =W03651532551000 Negative 11/25/2015 16:08 PAGYCNQ6 5 2D56
    

    And here is the more complex one:

    $objcol = @()
    $text = "=W03651532551000 Negative 11/25/2015 16:08 PAGYCNQ6 5 2D56`nLW03651532551000 Positive 11/25/2015 16:08 PAGYCNQ6 5 2D56"
    $res = $text.Split("`n") | where {
     $_ -match '(?<unit_id>=?\w+)\s+(?<status>\w+)\s+(?<determined>\d{2}\/\d{2}\/\d{4}\s+\d{2}:\d{2})\s+(?<bottle_id>\w+)\s+(?<time_to_find>\d+)\s+(?<cell>\w+)' 
    } | foreach {
       $obj = new-object PSObject –prop @{ 
        unitId=$matches['unit_id']; status=$matches['status']; 
        Determined=$matches['determined']; bottleId=$matches['bottle_id']; 
        timeToFind=$matches['time_to_find'] 
      }
      $objcol += $obj
     }
    Write-Output $objcol
    

    The result:

    bottleId   : PAGYCNQ6
    timeToFind : 5
    Determined : 11/25/2015 16:08
    unitId     : =W03651532551000
    status     : Negative
    
    bottleId   : PAGYCNQ6
    timeToFind : 5
    Determined : 11/25/2015 16:08
    unitId     : LW03651532551000
    status     : Positive