I am reading a pdf file with iText, in Powershell. I read each line. I need to know the color of the line I am reading. I have no idea about how to get that information.
This is the code I have so far:
Add-Type -Path "C:\Users\Ion\Documents\App\Scripts\itextsharp.dll"
$filePath="C:\Users\Scripts\Datos\ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf" # File to modify
$pdf = New-Object iTextSharp.text.pdf.PdfReader -ArgumentList $filePath
$export = ""
foreach($page in 1..($pdf.NumberOfPages)){
$export+=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
# $color = Here I should be able to get the color of the line to process it.
}
$pdf.Close()
$export | Out-File C:\Users\Scripts\Datos\datos.txt # The modified File
Here is the document I am working with:
https://drive.google.com/file/d/1Ix7AlE7B0ui1t0hGsAqzNrnfmduQoSc0/view?usp=share_link
I need to know the lines that are red (or blue). I have tried with methods like GetStrokeColor() with no luck, but not sure about the exact syntax.
Any clue? If there is another way to solve the problem out from Powershell or iText, it is also welcome as long as it can be automated.
Thanks!
I don't use PowerShell or iText to review PDFs it usually easier to use the console. So forgive this alternative but you asked for other ways, that can be combined with powershell. Thus if we conglomerate all the dozens of text parts into HTML spans it's much easier to first parse and detect those lines that are red or blue,
here is the first red and blue entry as PDF:-
q
1 0 0 rg
BT
/F2 10 Tf
1 0 0 1 56.8 641.6 Tm
[<0E>3<02>-5<03>1(")6<0108>-9.000001<03>1<0B>6<06>-4<0819>3<12>2<06>-4<080D>4<0108>-9.000001<03>1<14>3<02>-5<03>1(#)1<03>1<0A02>-5<03>1<080F>-3<07>2<03>1<0A0F>-3<0B>6<0B>-3<06>-4(\n\r)4<010813>] TJ
ET
Q
q
.2 .6 1 rg
BT
/F1 10 Tf
1 0 0 1 92.3 617.3 Tm
[<06>-3<1B>6(+)-2(1)-1<1A>-2<1E>-4<1F>1<341A>-2(3)] TJ
ET
Q
Here is second blue as HTML sorry it's in my native English
<p style="top:228.9pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff" _msthash="123188" _msttexthash="2418364">The legal basis for this question is found in the Preamble of the </span></p>
So how can that be programmatically done, easy run 3 lines of cmd (depends on one mutool.exe)
md output
REM we could query num pages and set=pages here but this is just a Proof Of Concept so use known 68
for /l %%i in (1,1,68) do mutool convert -o output\text%%i.html test.pdf %%i
REM from inspection of result we know
REM red = font-family:Verdana,serif;font-size:10.0pt;color:#ff0000
REM blue = font-family:Verdana,serif;font-size:10.0pt;color:#3399ff
REM so we can extract those independently
for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#ff0000" >>output\text%%c-red.txt
for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#3399ff" >>output\text%%c-blue.txt
[21]<p style="top:192.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Las respuestas b) y c) son correctas.</span></p>
[34]<p style="top:350.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) El pluralismo político.</span></p>
[45]<p style="top:508.3pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) En el Título Preliminar.</span><span style="font-family:Verdana,serif;font-size:10.0pt;color:#201c1d"> </span></p>
[65]<p style="top:714.9pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">a) Que la dignidad de la persona es fundamento del orden político y de la paz social.</span></p>
NOTE there is a slight wrinkle with line 3 as there is also some other colour (a rogue single space as #201c1d) that will need to be split off
You can do similar with simple text replacement done in PowerShell for your desired output, or mod the cmds to only export the parts you need, or add other colours etc.
The PDF fonts will be reflected in the HTML as <b>=bold
<i>=italic
File: ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf
Created: 9/20/2022 1:38:39 PM
Application: Writer
PDF Producer: OpenOffice 4.1.5
Fonts:
ArialUnicodeMS (TrueType; embedded)
Verdana (TrueType; embedded)
Verdana-Bold (TrueType; embedded)
Verdana-BoldItalic (TrueType; embedded)
Verdana-Italic (TrueType; embedded)
For red+blue combined replace last 2 lines with one
for /l %%c in (1,1,68) do type output\text%%c.html |findstr /n "#3399ff #ff0000" >>output\text%%c-red+blue.txt
Sample of first 4 red and blue lines on page 3, note second line is Verdana-Bold
12:<p style="top:58.8pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Por ley orgánica.</span></p>
13:<p style="top:83.1pt;left:92.3pt;line-height:10.0pt"><b><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Normativa:</span></b></p>
14:<p style="top:95.2pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">La fundamentación legal de esta pregunta la encontramos en el artículo 57.5 de la </span></p>
15:<p style="top:107.4pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Constitución Española, conforme al cual: </span></p>