Search code examples
pdftextdecodeencode

Can the stream encoding change depending on the document version?


Hello StackOverflow community, I have a very interesting question about Streams in PDF files, I have 5 pdfs..

When I am decoding pdf page content stream , I split em by textBlocks, and then I wanna just convert them into regular string. (I don't have a task to get text from PDF, the task requires that I parse this data stream and get text from it.)

And this is what I am getting:

1 document TEXT: Cyrillic alphabet

b'BT 11 0 0 11 0 0 Tm\n/TT2 1 Tf (!"#$%&\'\\(\\)*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_ab)\nTj ET'`

2 document TEXT: This is a small demonstration .pdf file -

b'BT\r\n/F1 0010 Tf\r\n69.2500 688.6080 Td\r\n( This is a small demonstration .pdf file - ) Tj\r\nET'

3 document TEXT: overall number of doses in the series

b'BT 2.400000 213.686918 Td [(\x00o\x00v\x00e\x00r\x00a\x00l\x00l\x00 \x00n\x00u\x00m\x00b\x00e\x00r\x00 \x00o\x00f\x00 \x00d\x00o\x00s\x00e\x00s\x00 \x00i\x00n\x00 \x00t\x00h\x00e\x00 \x00s\x00e\x00r\x00i\x00e\x00s)] TJ ET'

4 document TEXT: Date of birth in Romanian , Russian and English Languages

b'BT 27.111811 391.343714 Td [(\x00D\x00a\x00t\x00a\x00 \x00n\x00a\x02\x19\x00t\x00e\x00r\x00e\x00 \x00|\x00 \x04\x14\x040\x04B\x040\x00 \x04@\x04>\x046\x044\x045\x04=\x048\x04O\x00 \x00|\x00 \x00D\x00a\x00t\x00e\x00 \x00o\x00f\x00 \x00b\x00i\x00r\x00t\x00h\x00:)] TJ ET'

5 document TEXT: Example text

b'BT\n0 Tr\n/F1 79.848503 Tf\n1 0 0.000000 -1 196.000000 874.080017 Tm\n[<0028>-0.839844<005B><0044>-0.847656<0050>-0.832031<0053><004F>-0.832031<0048>-0.847656<0003><0057>-0.832031<0048>-0.847656<005B><0057>-0.832031] TJ\nET'

I know how to read first 2 documents , but I don't know the decode method.

I know how to read 3-4 documents , because I know that it uses unicode chars.(but I mentioned that it don't works well with Cyrillic alphabet).

I don't know how to work with 5th doc , and I didn't understand how to decode this type of coding..

I would welcome any answer, any explanation and any advice.

Thank you.


Solution

  • I am going to use your challenger example 5 to show a basic way PDF font Look-up can be used to untangle some encodings, this is a relatively simple and common mix.

    Your 5th example text is a very common CIDFont+F1 encoding remove the kerning (these are the small gap shifts between < 16bit letters > e.g. >-0.839844< means increase the gap by a small amount -= rightwards 0.8 sub units (an advance not measured in points) and we see starting [<0028>-0.839844<005B><0044>

    produces
    [<0028><005B><0044><0050><0053><004F><0048><0003><0057><0048><005B><0057>]

    This is a very common Cmap for that encoded text ignore the first entry, since here without yours I am using another "Print to PDF" file that did not use that binary string <0028> for subset (so missing in this case).

    Using this common look-up table we can translate using ASCII/ANSI bytes

    stream
    /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 33 beginbfchar <0003> <0020> <000F> <002C> <0011> <002E> <0013> <0030> <0014> <0031> <0015> <0032> <0016> <0033> <0017> <0034> <0018> <0035> <0019> <0036> <001A> <0037> <001B> <0038> <001C> <0039> <0020> <003D> <0037> <0054> <003B> <0058> <0044> <0061> <0045> <0062> <0047> <0064> <0048> <0065> <004A> <0067> <004B> <0068> <004C> <0069> <004F> <006C> <0050> <006D> <0051> <006E> <0052> <006F> <0053> <0070> <0055> <0072> <0056> <0073> <0057> <0074> <0059> <0076> <005B> <0078> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end 
    endstream
    

    So the above is paired Cmap so we find first of pair and convert to second

    <005B> <0078> x <0044> <0061> a <0050> <006D> m <0053> <0070> p <004F> <006C> l <0048> <0065> e <0003> <0020> <0057> <0074> t <0048> <0065> e <005B> <0078> x <0057> <0074> t

    enter image description here

    xample text is clearly correct so what is <0028>

    Well from above table it is 8 more than <0020> <003D> thus should be <0045>

    and blow me over sideways, it works too (but not always :-)

    enter image description here

    Overall it is an ex-streamly (pun:-) inefficient use of text as it would have been far simpler as /F# Tf (Example text) Tj and yet not been less readable by humans. All that kerning and 16bits for 8 + >precision< overheads just adds to the MagaFlops of carbon needed to read an accessible "Pretty" poorly justified Anglo-American text PDF.

    HTML is far more efficient for wasting bytes.