I'm using iTextSharp to search a PDF for a keyword, and extract any line(s) that contain that keyword. What I'd like to do is not only extract the line(s) with the keyword but subsequent lines. Line with keyword and the next line, Line with keyword and the next 2 lines, etc.
I've been hung up on this for awhile, trying arrays, hash tables, iterators...none of them are working right. Any help is appreciated. This is the basic design i've been working with: $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList anypdf.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
$lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
foreach ($line in $lines) {
if ($line -match $searchstring) {
$line = $line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
$line = $line -replace "\\([\S])", $matches[1]
Write-host $line
}
}
}
I can't take credit for the logic that strips out the unwanted characters from the PDF, and that may be why I haven't figured this out yet. The above code gets me any line that contains the keyword. The problem seems to be the PDF is split into pages and those pages are split into lines (which are each an array of characters). It would be nice and efficient if I could simply create a hash table of every line in the PDF from the start.
That's what Select-String
was invented for.
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
[char[]]$reader.GetPageContent($page) -join "" -split "`n" `
| Select-String $searchstring -Context 0,2 `
| % {
$_ -replace "^\[\(|\)\]TJ$", "" `
-split "\)\-?\d+\.?\d*\(" -join "" `
-replace "\\([\S])", $_.Matches.Value
}
}
I don't quite understand all the splitting and joinging and replacing you're doing there, so you may need to adjust that.
Also, the above doesn't include the after context, since I wouldn't know where you want it to go. It can be accessed via $_.Context.PostContext
.