Search code examples
c#pdfbinaryfilespdf-scraping

Create single page PDF from multi page PDF WITHOUT external libraries


I've saw the following question around SO: Create Multi-Page PDF from other PDFs

But it didn't replied what i need. Consider i have an PDF with 20 pages. So far so good.

From the same place, i can have a PDF with only one page. This one will be used as my template PDF. What i'm trying to do is to replace the content (FlateDecodeStream)(and length also) on the template PDF and generate a new single-paged one.

I got the PDF to work ; however, a small logo doesn't display and adobe reader says there is an problem to display the PDF correctly (google chrome and edge just doesn't display the logo, no error message).

I've tried to mess with the xref table in the end(manually adjusting values) but got the same results.

Is there anyone that has some knowledge on PDF to give me any input ?

I'm uploading the template_pdf and other one that i want to extract data and create a third pdf (using the template pdf but with the contents from another PDF). Also i'll be uploading a PDF i made manually that has error for displaying (it displays the data but without the JPEG logo).

Its everything here: https://drive.google.com/drive/folders/1tsGIbtbfwuATPQ6a_VPjnxLT4ozzNt0s?usp=sharing

I've been doing everything using HxD (to view hexadecimal content and copy\paste data)

Thanks in advance

EDIT: I'm adding the code i'm currently using for generating a PDF. Its an invalid PDF even with the xref table okay(with the proper positions). The code is extremly ugly, but for now i'm looking to make it work (instead of making a nice code)

static void Main(string[] args)
    {

        Console.WriteLine("Hello World!");


        var jpegLogo = File.ReadAllBytes(@"C:\test\Ginfes-Reboot\jpegLogo.raw");
        var pdfStream = File.ReadAllBytes(@"C:\test\Ginfes-Reboot\pdfStream.raw");
        using (BinaryWriter b = new BinaryWriter(
        File.Open(@"C:\test\Ginfes-Reboot\newPdf_newmethod.pdf", FileMode.Create)))
        {
            WritePDFAgain(b,jpegLogo,pdfStream);

        }

    }
    private static void WritePDFAgain(BinaryWriter b, byte[] jpegLogo,byte[] pdfStream)
    {
        List<long> offSets = new List<long>();
        string str = "%PDF-1.4" + "\n";
        var byteArr = Encoding.ASCII.GetBytes(str);
        b.Write(byteArr);
        byteArr = StringToByteArray("25E2E3CFD30A");
        b.Write(byteArr);
        offSets.Add(b.BaseStream.Position);//0
        str = "3 0 obj" + "\n" + "<</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width 60/Length 3857/Height 60/Filter/DCTDecode>>stream" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(jpegLogo);
        b.Write(Encoding.ASCII.GetBytes("\n"));
        b.Write(Encoding.ASCII.GetBytes("endstream" +"\n" + "endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//1
        str = "4 0 obj" + "\n" + "<</Length " + pdfStream.Length + "/Filter/FlateDecode>>stream" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(pdfStream);
        b.Write(Encoding.ASCII.GetBytes("\n"));
        b.Write(Encoding.ASCII.GetBytes("endstream" + "\n" + "endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//2
        str = "1 0 obj" + "\n" + "<</Group<</Type/Group/CS/DeviceRGB/S/Transparency>>/Parent 5 0 R/Contents 4 0 R/Type/Page/Resources<</XObject<</img0 3 0 R>>/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]/ColorSpace<</CS/DeviceRGB>>/Font<</F1 2 0 R>>>>/MediaBox[0 0 595 936]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//3
        str = "6 0 obj" + "\n" + "[1 0 R/XYZ 0 814 0]" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//4
        str = "2 0 obj" + "\n" + "<</BaseFont/Helvetica/Type/Font/Encoding/WinAnsiEncoding/Subtype/Type1>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//5
        str = "5 0 obj" + "\n" + "<</ITXT(2.1.7)/Type/Pages/Count 1/Kids[1 0 R]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//6
        str = "7 0 obj" + "\n" + "<</Names[(JR_PAGE_ANCHOR_0_1) 6 0 R]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//7
        str = "8 0 obj" + "\n" + "<</Dests 7 0 R>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//8
        str = "9 0 obj" + "\n" + "<</Names 8 0 R/Type/Catalog/ViewerPreferences<</PrintScaling/AppDefault>>/Pages 5 0 R>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//9
        str = "10 0 obj" + "\n" + @"<</Creator(JasperReports \(nfs_novo\))/Producer(iText 2.1.7 by 1T3XT)/ModDate(D:20191211152903-03'00')/CreationDate(D:20191211152903-03'00')>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("xref" + "\n" + "0 11" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("0000000000 65535 f " + "\n"));            
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(2) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(4) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000000"+ offSets.ElementAt(0) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(1) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(5) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(3) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(6) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(7) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(8) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(9) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("trailer" + "\n" + "<</Root 9 0 R/ID [<10a2f7fd162aa44a268ebb6f31cc98c4><c36ebb9dc93cd9a72f229f618092eeb0>]/Info 10 0 R/Size 11>>" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("startxref" + "\n" + (b.BaseStream.Position + 6) + "%%EOF" + "\n"));
    }

Files used: https://drive.google.com/drive/folders/1i3J-yioFvcoiakyc_Wi8ddn9g6Pxy7zd?usp=sharing


Solution

  • You are most of the way there; the only problem with the resulting PDF from your example is that the image resource referenced in pdfStream is named img10, whereas the name you are assigning when you create the resource dictionary is img0.

    Below is some code that will identify the correct referenced resource (using a regular expression on the page content), which you can then use when building the dictionary.

    You need these additional using directives:

    using System.IO.Compression;
    using System.Text.RegularExpressions;
    

    This method decompresses the page content stream and matches the image resource name:

    private static string GetImageResourceName(byte[] pdfStream) {
        using (MemoryStream ms = new MemoryStream(pdfStream)) {                
            ms.Seek(2, SeekOrigin.Begin);   // skip first 2 bytes (zlib header)
    
            using (DeflateStream ds = new DeflateStream(ms, CompressionMode.Decompress)) {
                using (StreamReader sr = new StreamReader(ds)) {
                    string contents = sr.ReadToEnd();
    
                    // PostScript command referencing the image resource looks like: /img123 Do
                    return Regex.Match(contents, @"\b(img\d+)\s+Do\b").Groups[1].Value;
                }
            }
        }
    }
    

    Finally, you only need to change this line in your WritePDFAgain method:

    str = String.Format(
        "1 0 obj\n<</Group<</Type/Group/CS/DeviceRGB/S/Transparency>>" 
        + "/Parent 5 0 R/Contents 4 0 R/Type/Page/Resources<</XObject" 
        + "<</{0} 3 0 R>>/ProcSet [/PDF /Text /ImageB /ImageC " 
        + "/ImageI]/ColorSpace<</CS/DeviceRGB>>/Font<</F1 2 0 R>>>>" 
        + "/MediaBox[0 0 595 936]>>\n", 
        GetImageResourceName(pdfStream)
    );
    

    As per my disclaimer in the comments, this code will only work for this very specific case and input data. It is by no means a general purpose solution, but I think you accept that.

    I will reiterate my point that if you are intent on not using any external libraries for this, then you will likely end up writing your own (albeit a very basic one).