Search code examples
c#pdfpdf-scraping

Build a PDF file Manually from scratch and embed images


I'm trying to generate a PDF file programmatically.

The entire case is: I'm receiving a multiple page PDFS. Each page is an image, with the contents i want. I don't want to use external libraries because i'm looking for performance \ optimization (in the long run it will matter to me). I used to have something already working (i created a system like header\file content(image)\footer), and it always worked. However, something has changed and it stopped working.

Anyway, in order to fix it and build from scratch, here are the steps i executed:

  1. Extracted the FlateDecode portion related to the image file (one of many)
  2. Created an clean JPEG from it(no photoshop headers or etc, a simple JPEG file)
  3. Submitted the file to some online PDF converting service ; created an file from this JPEG.
  4. Identified how the PDF file was built and the image part. Coded everything manually, included references in the xref table
  5. All i get is that "The file is damaged". I've compared both files (original and the one i made), and they both seem to be almost equal (size difference because of the image portion).

I don't know what else to do since everything seems to be almost exatly. I've also decoded some string FlateDecode portion inside the PDF file but i couldn't find anything related to object positioning inside the file.

Here's the code i'm using:

using (var b = new BinaryWriter(File.Open(@"C:\test\Rio\Reboot\fullmanual01.pdf", FileMode.Create)))
{
    var imgBytes = File.ReadAllBytes(@"C:\test\Rio\Reboot\decompressedimg.raw");
    var firstFlate = File.ReadAllBytes(@"C:\test\Rio\Reboot\flateStr01.raw");
    var FlateDecompressed = Encoding.ASCII.GetString(FlateDecompress(firstFlate));
    string crlf = Environment.NewLine;

    var pdfHeader = Encoding.ASCII.GetBytes($"%PDF-1.4{crlf}");
    b.Write(pdfHeader);
    pdfHeader = StringToByteArray("25E2E3CFD30D0A");
    b.Write(pdfHeader);
    var pdfObj = new PDFStrObject(1, $"/Type /Page{crlf}/MediaBox [ 0 0 595 769 ]{crlf}/Resources << /XObject << /X0 3 0 R >> >>{crlf}/Contents 4 0{crlf}/Parent 2 0 R{crlf}/Rotate 360{crlf}>>{crlf}endobj{crlf}").byteFromStrObj;
    b.Write(pdfObj);
    var secondObjPos = b.BaseStream.Position.ToString("0000000000");
    pdfObj = new PDFStrObject(3, $"/Type /XObject{crlf}/Subtype /Image{crlf}/Width 1016{crlf}/Height 1328{crlf}/BitsPerComponent 8{crlf}/ColorSpace /DeviceGray{crlf}/Filter /FlateDecode{crlf}/Length {imgBytes.Length}{crlf}>>{crlf}stream{crlf}").byteFromStrObj;
    b.Write(pdfObj);
    b.Write(imgBytes);
    b.Write(Encoding.ASCII.GetBytes($"{crlf}endstream{crlf}endobj{crlf}"));
    var thirdObjPos = b.BaseStream.Position.ToString("0000000000");
    pdfObj = new PDFStrObject(4, $"/Filter /FlateDecode{crlf}/Length 45{crlf}>>{crlf}stream{crlf}").byteFromStrObj;
    b.Write(pdfObj);
    b.Write(firstFlate);
    b.Write(Encoding.ASCII.GetBytes($"{crlf}endstream{crlf}endobj{crlf}"));
    var secondPos = b.BaseStream.Position;
    pdfObj = new PDFStrObject(2, $"/Type /Pages{crlf}/Kids [ 1 0 R ]{crlf}/Count 1{crlf}>>{crlf}endobj{crlf}").byteFromStrObj;
    b.Write(pdfObj);
    var firstObjPos = b.BaseStream.Position.ToString("0000000000"); //2 0 obj
    pdfObj = new PDFStrObject(5, $"/Type /Catalog{crlf}/Pages 2 0{crlf}>>{crlf}endobj{crlf}").byteFromStrObj;
    b.Write(pdfObj);
    var fourthObhPos = b.BaseStream.Position.ToString("0000000000");
    b.Write(Encoding.ASCII.GetBytes($"xref{crlf}0 6{crlf}"));
    b.Write(Encoding.ASCII.GetBytes($"0000000000 65535 f{crlf}0000000017 00000 n{crlf}"));

    b.Write(Encoding.ASCII.GetBytes($"{firstObjPos} 00000 n{crlf}"));

    b.Write(Encoding.ASCII.GetBytes($"{secondObjPos} 00000 n{crlf}"));

    b.Write(Encoding.ASCII.GetBytes($"{thirdObjPos} 00000 n{crlf}"));
    b.Write(Encoding.ASCII.GetBytes($"{fourthObhPos} 00000 n{crlf}"));
    b.Write(Encoding.ASCII.GetBytes($"trailer{crlf}<<{crlf}/Size 6{crlf}/Root 5 0{crlf}/ID [<05bebfaf5c6382cfbc44cd1b3389e097><05bebfaf5c6382cfbc44cd1b3389e097>]{crlf}>>{crlf}startxref{crlf}{b.BaseStream.Position+7}{crlf}%%EOF{crlf}"));
}

and the class for building objects:

class PDFStrObject
{
    public string strObj { get; private set; }
    public byte[] byteFromStrObj { get; private set; }
    public PDFStrObject(int objNum, string content)
    {
        string crlf = Environment.NewLine;

        strObj =  $"{objNum} 0 obj{crlf}<<{crlf}{content}";
        byteFromStrObj = Encoding.ASCII.GetBytes(strObj);
    }
}

The files i've been using are here: https://drive.google.com/drive/folders/11HN9cB9Cs7uqBQdpZkNyNKt29sl_xJrL?usp=sharing

The description is:

decompressedimg-convertido.pdf -> The file i converted online.

decompressedimg.raw -> The image portion i extracted from the multi-page PDF. Dimensions are W: 1016, H: 1328

fullmanual01.pdf -> The file i generated using my code.

PDfRjMultiplePages -> The PDF file with multiple pages i'm willing to programatically extract pages from.

Any input is appreciated. I've also reffered to the question: Issue writing a PDF file from scratch but couldn't find a hint for what i'm trying to do (unfortunately)

Tanks


Solution

  • This first thing that stands out is your startxref is pointing to the wrong spot.

    enter image description here

    It points to the red, but should point to the blue spot.

    The other obvious issue, is that you have an earlier xref table, in the middle of the file. So either you attempted (perhaps inadvertently) to create either a Linearized or Incremental PDF file. Based on your description there is no point in doing either of those. You should just stick to basic PDF, one xref table at the end of the file.

    You should take a closer look at that post you referenced, it seems like a good starting point.

    The PDF 1.7 spec also provides very simple, hello work, examples.

    There may very well be other issues. You may want to reconsider using a 3rd party library to create your PDF files.