Search code examples
itextpdf-parsing

PdfReaderContentParser.ProcessContent returns whitespace for clear text


I'd like to parse a pdf for texts containing both, binary and clear text data. When I try to do it with PdfReaderContentParser the GetResultantText method returns the right texts for the binary content but whitespaces for the clear text content. Here is the code I use:

        byte[] binaryPdf = File.ReadAllBytes(this.fileName);
        reader = new PdfReader(binaryPdf);

        PdfReaderContentParser parser = new PdfReaderContentParser(reader);

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            SimpleTextExtractionStrategy simpleStragety = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
            string contentText = simpleStragety.GetResultantText();

            // Do something with the contentText
            // ...
        }

Any idea how to get all content?


Solution

  • Overview

    In a comment the OP clarified which texts he was missing in his extracted text:

    Basically for all descriptions on the left-hand side (e.g. Lifting moment) I get whitespaces instead of the actual text.

    The reason for this is fairly simple: In the page content there are only spaces (if anything at all) on most of the left side. The labels you see actually are read-only form fields.

    For example the "Lifting moment" is the value of the form field 13B141032.

    If you want text extraction to include these fields, too, you should consider flattening the document in a first step (moving the field appearances into the regular page content stream) and extracting text from this flattened document.

    Document analysis

    It looks like the major part of the internationalization of the specification labels has been done using form fields.

    For an overview I separated the original document

    original document

    into its regular page content

    page content

    and the form fields

    page fields

    There indeed are several strings of spaces in the page content under the form fields.

    I would assume that there once was an earlier version of that document (or a template for it) which contained those labels (maybe in only one language or probably two) as page content.

    Then there was a task of more dynamic internationalization, so someone replaced the existing labels in the page content by spaces and added new internationalized labels as read-only form-fields, probably because form fields are easier to manipulate.

    Considering that the original labels seem to have been replaced by an equal number of spaces, though, one might speculate that there even is another program manipulating the page stream of this and similar documents at hard coded offsets, and to not break this program in the course of internationalization the actual labels had to be created outside the page content. Stranger things have happened...

    Flatten and extract

    As mentioned above, if you want text extraction to include these fields, too, you should consider flattening the document in a first step (moving the field appearances into the regular page content stream) and extracting text from this flattened document. This can be done like this:

    [Test]
    public void ExtractFlattenedTextTestSeeb()
    {
        FileInfo file = new FileInfo(@"PATH_TO_FILE\41851208.pdf");
        Console.Out.Write("41851208.pdf, flattened before extraction\n\n");
    
        using (MemoryStream memStream = new MemoryStream())
        {
            using (PdfReader readerOrig = new PdfReader(file.FullName))
            using (PdfStamper stamper = new PdfStamper(readerOrig, memStream))
            {
                stamper.Writer.CloseStream = false;
                stamper.FormFlattening = true;
            }
            memStream.Position = 0;
            using (PdfReader readerFlat = new PdfReader(memStream))
            {
                PdfReaderContentParser parser = new PdfReaderContentParser(readerFlat);
    
                for (int i = 1; i <= readerFlat.NumberOfPages; i++)
                {
                    SimpleTextExtractionStrategy simpleStragety = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
                    string contentText = simpleStragety.GetResultantText();
    
                    Console.Write("Page {0}:\n\n{1}\n\n", i, contentText);
                }
            }
        }
    }
    

    The result StandardOutput:

    41851208.pdf, flattened before extraction
    
    Page 1:
    
    90–120 l/min 
    (23.8–31.7 US gal./min) 
    60 kg 
    (132 lbs) 
    115 kg 
    (254 lbs) 
    350 l 
    (92.5 US gal.) 
    100 kg 105 kg 
    (220 lbs) (231 kg) 
    100 kg 
    (220 lbs) 
    250 l 300 l 
    (66.0 US gal.) (79.3 US gal.) 
    90 kg 
    (198 lbs) 
    180 l 
    (47.6 US gal.) 
    5305kg 
    (11695 lbs) 
    5265kg 
    (11607 lbs) 
    5395kg 
    (11894 lbs) 
    5205kg 
    (11475 lbs) 
    5010kg 
    (11045 lbs) 
    4780kg 
    (10538 lbs) 
    4470kg 
    (9854 lbs) 
    4190kg 
    (9237 lbs) 
    3930kg 
    (8664 lbs) 
    5215kg 
    (11497 lbs) 
    5045kg 
    (11122 lbs) 
    4860kg 
    (10714 lbs) 
    4650kg 
    (10251 lbs) 
    4350kg 
    (9590 lbs) 
    4100kg 
    (9039 lbs) 
    3850kg 
    (8488 lbs) 
    25.2 m 
    (82’ 8") 
    23.2 m 
    (76’ 1") 
    21.0 m 
    (68’ 11") 
    18.7 m 
    (61’ 4") 
    16.4 m 
    (53’ 10") 
    14.1 m 
    (46’ 3") 
    11.8 m 
    (38’ 9") 
    9.7 m 
    (31’ 10") 
    7.7 m 
    (25’ 3") 
    36.5 MPa (365 bar) 
    (5293 psi) 
    endlos 
    endless 
    sans finite 
    25.2 m 
    31.2 m 
    (82’ 8") 
    (102’ 4") 
    21.0 m 
    (68’ 11") 
    14900kg 
    (32848 lbs) 
    403.2 kNm (41.1 mt) 
    (297270 ft.lbs) 
    49.1 kNm (5.0 mt) 
    PK 42002–SH A–G 
    (36210 ft.lbs) 
    37.3 kNm (3.8 mt) 
    PK 42002–SH A–C 
    (27510 ft.lbs) 
    
    1GETR 2GETR
    PK 42002–SH A – C
    KT250 KT300 KT350 KT180
    
    
    
    2GETR STZY
    
    
    
    +V1
    +V2
    +2/4
    7(F) 8(G) 6(E) 5(D) 4(C) 3(B) 2(A)
    
    
    
    +V1
    +V2
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    (S410–SK–D)
    DTS410SHC/03
    0100
    11/2010
    
    
    
    PK 42002–SH
    Type Model Modell
    Page Page Seite
    Chapitre Chapter Kapitel
    Edition Edition Ausgabe
    
    
    
    Öltank
    Mehrgewicht: 
    Alle Gewichtsangaben ohne Aufbauzubehör,Zusatzgeräte und Öl. 
    Hydr. Ausschübe:
    Max. Reichweite + Fly-Jib:
    Max. Reichweite: 
    Fördermenge der Pumpe: 
    Betriebsdruck: 
    Schwenkmoment: 
    Schwenkbereich: 
    Max. Reichweite: 
    Max. hydraulische Reichweite: 
    Max. Hubkraft: 
    Max. Hubmoment:
    Gewicht +V ohne 2/4
    Krangewicht (R3X,STZS): 
    Technische Daten 
    Konstruktionsänderungen vorbehalten, fertigungstechn. Toleranzen müssen berücksichtigt werden. 
    Oil tank
    Excess weight: 
    All weights given without assembly accessory,additional devices and oil. 
    Hydr. boom extensions:
    Max. outreach + Fly-Jib: 
    Max. outreach: 
    Pump capacity: 
    Operating pressure:
    Slewing torque: 
    Slewing angle: 
    Max. outreach: 
    Max. hydraulic outreach: 
    Max. lifting capacity: 
    Lifting moment:
    Weight +V without 2/4
    Crane weight (R3X,STZS): 
    Specifications 
    Subject to change, production tolerances have to be taken into account. 
    Réservoir
    Excessif poids: 
    Tous les poids sans huile ni accessoire de montage ni appareils accessoires 
    Extensions hydrauliques:
    Portee maximale + Fly-Jib: 
    Max. portee: 
    Debit de pompe: 
    Pression d' utilisation:
    Couple de rotation: 
    Angle de rotation: 
    Max. portee: 
    Portee hydraulique maximale: 
    Capacite maxi de levage:
    Couple de levage:
    Poids +V sans 2/4
    Poids grue (R3X,STZS): 
    Données Techniques 
    Sous reserve de modifications de conception. Les tolerances relatives a la technique de production doivent etre prises en consideration.
    

    As you see, "Lifting moment" and all the other missing labels are there now.