Search code examples
powershellpdfitextitext7

How to read the color of a line in a pdf with iText?


I am reading a pdf file with iText, in Powershell. I read each line. I need to know the color of the line I am reading. I have no idea about how to get that information.

This is the code I have so far:

Add-Type -Path "C:\Users\Ion\Documents\App\Scripts\itextsharp.dll"
$filePath="C:\Users\Scripts\Datos\ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf"  # File to modify
$pdf = New-Object iTextSharp.text.pdf.PdfReader -ArgumentList $filePath

$export = ""
foreach($page in 1..($pdf.NumberOfPages)){
    $export+=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
    # $color =  Here I should be able to get the color of the line to process it.    
}
$pdf.Close()

$export | Out-File C:\Users\Scripts\Datos\datos.txt # The modified File

Here is the document I am working with:

https://drive.google.com/file/d/1Ix7AlE7B0ui1t0hGsAqzNrnfmduQoSc0/view?usp=share_link

I need to know the lines that are red (or blue). I have tried with methods like GetStrokeColor() with no luck, but not sure about the exact syntax.

Any clue? If there is another way to solve the problem out from Powershell or iText, it is also welcome as long as it can be automated.

Thanks!


Solution

  • I don't use PowerShell or iText to review PDFs it usually easier to use the console. So forgive this alternative but you asked for other ways, that can be combined with powershell. Thus if we conglomerate all the dozens of text parts into HTML spans it's much easier to first parse and detect those lines that are red or blue,
    here is the first red and blue entry as PDF:-

    q
    1 0 0 rg
    BT
    /F2 10 Tf
    1 0 0 1 56.8 641.6 Tm
    [<0E>3<02>-5<03>1(")6<0108>-9.000001<03>1<0B>6<06>-4<0819>3<12>2<06>-4<080D>4<0108>-9.000001<03>1<14>3<02>-5<03>1(#)1<03>1<0A02>-5<03>1<080F>-3<07>2<03>1<0A0F>-3<0B>6<0B>-3<06>-4(\n\r)4<010813>] TJ
    ET
    Q
    q
    .2 .6 1 rg
    BT
    /F1 10 Tf
    1 0 0 1 92.3 617.3 Tm
    [<06>-3<1B>6(+)-2(1)-1<1A>-2<1E>-4<1F>1<341A>-2(3)] TJ
    ET
    Q
    

    Here is second blue as HTML sorry it's in my native English

    <p style="top:228.9pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff" _msthash="123188" _msttexthash="2418364">The legal basis for this question is found in the Preamble of the </span></p>
    

    enter image description here

    So how can that be programmatically done, easy run 3 lines of cmd (depends on one mutool.exe)

    md output
    REM we could query num pages and set=pages here but this is just a Proof Of Concept so use known 68
    for /l %%i in (1,1,68) do mutool convert -o output\text%%i.html test.pdf %%i 
    REM from inspection of result we know 
    REM red  = font-family:Verdana,serif;font-size:10.0pt;color:#ff0000
    REM blue = font-family:Verdana,serif;font-size:10.0pt;color:#3399ff
    REM so we can extract those independently
    for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#ff0000" >>output\text%%c-red.txt
    for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#3399ff" >>output\text%%c-blue.txt
    

    Result enter image description here

    [21]<p style="top:192.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Las respuestas b) y c) son correctas.</span></p>
    [34]<p style="top:350.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) El pluralismo pol&#xed;tico.</span></p>
    [45]<p style="top:508.3pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) En el T&#xed;tulo Preliminar.</span><span style="font-family:Verdana,serif;font-size:10.0pt;color:#201c1d"> </span></p>
    [65]<p style="top:714.9pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">a) Que la dignidad de la persona es fundamento del orden pol&#xed;tico y de la paz social.</span></p>
    

    NOTE there is a slight wrinkle with line 3 as there is also some other colour (a rogue single space as #201c1d) that will need to be split off enter image description here

    You can do similar with simple text replacement done in PowerShell for your desired output, or mod the cmds to only export the parts you need, or add other colours etc.

    The PDF fonts will be reflected in the HTML as <b>=bold <i>=italic

    File: ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf
    Created: 9/20/2022 1:38:39 PM
    Application: Writer
    PDF Producer: OpenOffice 4.1.5
    Fonts: 
    ArialUnicodeMS (TrueType; embedded)
    Verdana (TrueType; embedded)
    Verdana-Bold (TrueType; embedded)
    Verdana-BoldItalic (TrueType; embedded)
    Verdana-Italic (TrueType; embedded)
    

    P.S.

    For red+blue combined replace last 2 lines with one

    for /l %%c in (1,1,68) do type output\text%%c.html |findstr /n "#3399ff #ff0000" >>output\text%%c-red+blue.txt
    

    Sample of first 4 red and blue lines on page 3, note second line is Verdana-Bold

    12:<p style="top:58.8pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Por ley org&#xe1;nica.</span></p>
    13:<p style="top:83.1pt;left:92.3pt;line-height:10.0pt"><b><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Normativa:</span></b></p>
    14:<p style="top:95.2pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">La fundamentaci&#xf3;n legal de esta pregunta la encontramos en el art&#xed;culo 57.5 de la  </span></p>
    15:<p style="top:107.4pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Constituci&#xf3;n Espa&#xf1;ola, conforme al cual: </span></p>