pdf pdf-generation adobe offset postscript

problems with calculating byte offset

I'm tring to understand PDF struckture right now but I have a little Problem with calculating the byte offset of an String. The offsets of the objects are couted fom the begining of the file to the index of the object (6 0 obj).

I have a working hello world PDF file but when I count the offsets I get a diffrent offset than in the xref table.

If anybody understands how this is counted please let me know!

Example:

0 6 obj xref:9 me:17

0 1 obj xref:60 me:72

0 4 obj xref:145 me 187

(I count with "\r\n" (2) as line break)

Adobe Standart:http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf

%PDF-1.4
%%EOF
6 0 obj
<<
/Type /Catalog
/Pages 5 0 R
>>
endobj
1 0 obj
<<
/Type /Page
/Parent 5 0 R
/MediaBox [ 0 0 612 792 ]
/Resources 3 0 R
/Contents 2 0 R
>>
endobj
4 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont/Helvetica
>>
endobj
2 0 obj
<<
/Length 53
>>
stream
BT
/F1 24 Tf
1 0 0 1 260 600 Tm
(Hello World)Tj
ET
endstream
endobj
5 0 obj
<<
/Type /Pages
/Kids [ 1 0 R ]
/Count 1
>>
endobj
3 0 obj
<<
/ProcSet[/PDF/Text]
/Font <</F1 4 0 R >>
>>
endobj
xref
0 7
0000000000 65535 f
0000000060 00000 n
0000000228 00000 n
0000000424 00000 n
0000000145 00000 n
0000000333 00000 n
0000000009 00000 n
trailer
<<
/Size 7
/Root 6 0 R
>>
startxref
488
%%EOF

Solution

This is a very interesting file and reading the PDF specification initially just confused me more :-). In such cases (I'll madden some people with this) I would simply save the example PDF file and do as @KenS suggests in his previous answer; open it in Acrobat and if Acrobat reports it as damaged or asks you to save when you close the file - it doesn't like it and you can assume you've gotten it wrong.

The reason this file is interesting is the second line, the:

%%EOF

I don't agree with KenS that having this line automatically invalidates the file - I can find no text in ISO 32000 that states this. The text says that the %%EOF line at the end of a file has syntactical meaning (and explains why it is there) and it states that any line beginning with a percentage character (%) is a comment and what that means. But nowhere does it state that %%EOF is not allowed as comment somewhere else in the file (though I consider it a dumb thing to do but that is something different).

If that %%EOF line isn't there, the XREF table is correct. If it is there, its wrong. Some more explanation of what I read in the documentation:

1) As far as I understand the offset is starting from the first byte of the file (it's a byte offset, not a character offset) which is "0" and then counts up. The idea behind this is that you can open a file, set the file read position to a given offset and start reading. So if you open up the file in a binary editor that shows real bytes, the offset should match what you're seeing there. If your %%EOF line isn't there, that means the first object (6 0 obj) effectively begins at offset 9 (if you line ending character here is a single byte line ending). At this point it matches what is given as an example in the PDF specification itself, so I'm confident that offset of 9 is correct provided that second line (%%EOF) would not be in the PDF file.

2) That second line starts with a percentage sign which makes it a comment. The PDF specification states that a comment (everything from the % sign up to but not including the line end character) shall be interpreted as a single whitespace character. That's interesting and could lead to all kinds of speculation on what that means for the offset of the object following it but frankly all of that speculation is out of order and irrelevant because of what I stated before.

The idea behind this is that you can open a file, set the file read position to a given offset and start reading.

That's exactly what the cross-reference table is for and it should be taken literally. In other words, assuming single-byte line ending characters, object 6 in your example file starts at offset 15 and that's the number that should be in the XREF table for that object.

Again, take @KenS' comment into account, you cannot just assume the line ending is two bytes, you have to know what they are (and they could be mixed so you can't even assume all lines have the same). If this file would have two byte line endings for all lines, your count of 17 would be the correct one.