Search code examples
parsingpdfpdf-parsing

How order text extracted from pdf?


I'm building a pdf parser that extract text and save it into a txt file.
I'm doing that by tracing all content objects, then decode the streams using the font encoding. what I found I bit challenging is how to place the text in its right order, I don't care about how really it looks, all I want is the order of the sequences, I don't care about font size,space between text...etc.

So how can I deal with Tm,Td,TD and T* if all I care about is the order?

Another question sometimes one content object contains streams that are from 2 different pages how can I know when the streams of the next page started?


Solution

  • As your question is very generic, this answer also is generic.

    So how can I deal with Tm,Td,TD and T* if all I care about is the order?

    There are a few main options only:

    • You may ignore everything except the text showing operators, in particular you ignore the mentioned operators. A relevant number of documents allow pretty faithful text extraction that way as the text showing operators in those documents occur in the natural reading order.

      Considering that you ask this question at all, though, seems to indicate that you have encountered documents that are differently built.

    • If the documents in question are appropriately tagged, you can use the MCIDs in the content stream in combination with the document structure tree to sort (and categorize!) the text pieces you extract as before. For tagged documents this usually results in a reasonable good text extraction result.

    • Otherwise, even if you "only" care about the order, you have to extract the exact positions of the text pieces together with those pieces themselves and eventually sort them accordingly. And this means not only considering the operations you mention but also cm (changes the current transformation matrix), q (saving the current graphics state), and Q (restoring the current graphics state).

      Furthermore, for a solution for a very wide range of documents you also have to analyze the layout to recognize multiple columns and text belonging to figures instead of the main text flow.

    Of course there may be some variations of those main options, for example assuming the order of text objects to be correct and only sorting inside them; those may allow to ignore some operators, e.g. in the example just mentioned one may ignore cm, q, and Q as they are not allowed in text objects.

    Another question sometimes one content object contains streams that are from 2 different pages how can I know when the streams of the next page started?

    I'm not sure I understand you correctly. Do you mean that the same content stream is referenced from multiple page objects which show only certain, probably distinct parts of the content? In that case you have to extract the exact position of the text pieces, too, and check whether they are inside or outside the crop box of the respective page.

    EDIT

    In comments you meanwhile clarified and supplied example data. Analyzing that data gives rise to the following addition:

    First of all, not everything you marked blue in the content stream of page 16 can be found on page 17 instead of page 16; for example very early (in the second line of your "closer look at object 127") you can see the instruction drawing the page number, (16)Tj, which is to be found on page 16, not 17.

    But indeed, soon a large section of text objects starts for which you find the text in a viewer only on page 17, not 16. The reason for this is that those text objects draw outside the visible page area and outside the current clip path!

    In more detail:

    Both page 16 and 17 have a CropBox of [ 0.0 0.0 481.89 680.315 ], i.e. the visible area is the part of the canvas with (0, 0) in the bottom left corner and (481.89, 680.315) in the top right one.

    Then you find the following instructions early in the part of the page 16 content you marked in blue:

    0 0 481.89 680.315 re
    W n
    

    This intersects the current clip path with the rectangle with (0, 0) in the bottom left corner and (481.89, 680.315) in the top right one.

    Thus, everything drawn outside that box on page 16 is invisible for two reasons.

    But the following text objects in that content marked in blue draw text at positions with negative x coordinates! For example immediately after the clip path changing instructions above:

    BT
    0 0 0 1 k
    /GS1 gs
    /C2_0 1 Tf
    15 0 0 15 -441.875 556.9449 Tm
    [<0003007000640059006E005F004300D0>-48<00030044004C>-47<000300CA006E>1<0066003D00ED>-48<000300BA007B0065005C003D>-48<000300700059006E005F005500D0>-47<0003007000650063009D004300D0>-48<0003007F006800FD00DA>-48<00030007000C000C000C0008>-45<0003006E006900CC>-47<000300EF007B00640052>-49<000300BA007B005F003D00ED>-48<000300EC007B004100ED>-48<00030101>-47<0003007B0065002200D0>]TJ
    ET
    EMC 
    /Span <</MCID 243 >>BDC 
    BT
    /C2_0 1 Tf
    15 0 0 15 -441.875 534.9449 Tm
    [<0003008B0053007D003D>-289<0003007000650063009D0043006E003D>-289<000300D2007B00680062004300D0>-291<000300E50077>]TJ
    ET
    EMC 
    

    The fifth entries of those text matrix setting instructions (Tm) essentially is the x coordinate from which the following text drawing instruction will draw text left-to-right. The value of -441.875 there clearly is outside the box mentioned above.

    If you look at the content stream of the following page 17, you'll find similar instructions for drawing the same text:

    BT
    /C2_0 1 Tf
    15 0 0 15 40.0148 556.9449 Tm
    [<0003007000640059006E005F004300D0>-48<00030044004C>-47<000300CA006E>1<0066003D00ED>-48<000300BA007B0065005C003D>-48<000300700059006E005F005500D0>-47<0003007000650063009D004300D0>-48<0003007F006800FD00DA>-48<00030007000C000C000C0008>-45<0003006E006900CC>-47<000300EF007B00640052>-49<000300BA007B005F003D00ED>-48<000300EC007B004100ED>-48<00030101>-47<0003007B0065002200D0>]TJ
    ET
    EMC 
    /Artifact <</O /Layout >>BDC 
    BT
    /C2_0 1 Tf
    15 0 0 15 40.0148 534.9449 Tm
    [<0003008B0053007D003D>-289<0003007000650063009D0043006E003D>-289<000300D2007B00680062004300D0>-291<000300E50077>]TJ
    ET
    EMC 
    

    In contrast to the instructions on page 16, though, the x coordinates here are 40.0148 which clearly is inside the CropBox of page 17.


    Beware: Above I said the fifth entries of those text matrix setting instructions (Tm) essentially is the x coordinate from which the following text drawing instruction will draw text left-to-right. The coordinates meant here are the current user space coordinates.

    If there had been cm instructions before, the current user space coordinates would not necessarily be the default user space coordinates in which the crop box is defined.

    In case of your document, though, no cm instruction is used before the instructions discused above.