Search code examples
c#pdfjdf

How to convert a JDF file to a PDF (Removing text from a multi-encoded document)


I am trying to convert a JDF file to a PDF file using C#.

After looking at the JDF format... I can see that the file is simply an XML placed at the top of a PDF document.

I've tried using the StreamWriter / StreamReader functionality in C# but due to the PDF document also containing binary data, and variable newlines (\r\t and \t) the file produced cannot be opened as some of the binary data is distroyed on the PDF's. Here is some of the code I've tried using without success.

using (StreamReader reader = new StreamReader(_jdf.FullName, Encoding.Default))
{
    using (StreamWriter writer = new StreamWriter(_pdf.FullName, false, Encoding.Default))
    {

        writer.NewLine = "\n"; //Tried without this and with \r\n

        bool IsStartOfPDF = false;
        while (!reader.EndOfStream)
        {
            var line = reader.ReadLine();

            if (line.IndexOf("%PDF-") != -1)
            {
                IsStartOfPDF = true;
            }

            if (!IsStartOfPDF)
            {
                continue;
            }

            writer.WriteLine(line);
        }
    }
}

Solution

  • I am self answering this question, as it may be a somewhat common problem, and the solution could be informative to others.

    As the document contains both binary and text, we cannot simply use the StreamWriter to write the binary back to another file. Even when you use the StreamWriter to read a file then write all the contents into another file you will realize differences between the documents.

    You can utilize the BinaryWriter in order to search a multi-part document and write each byte exactly as you found it into another document.

    //Using a Binary Reader/Writer as the PDF is multitype
    using (var reader = new BinaryReader(File.Open(_file.FullName, FileMode.Open)))
    {
        using (var writer = new BinaryWriter(File.Open(tempFileName.FullName, FileMode.CreateNew)))
        {
    
            //We are searching for the start of the PDF 
            bool searchingForstartOfPDF = true;
            var startOfPDF = "%PDF-".ToCharArray();
    
            //While we haven't reached the end of the stream
            while (reader.BaseStream.Position != reader.BaseStream.Length)
            {
                //If we are still searching for the start of the PDF
                if (searchingForstartOfPDF)
                {
                    //Read the current Char
                    var str = reader.ReadChar();
    
                    //If it matches the start of the PDF signiture
                    if (str.Equals(startOfPDF[0]))
                    {
                        //Check the next few characters to see if they match
                        //keeping an eye on our current position in the stream incase something goes wrong
                        var currBasePos = reader.BaseStream.Position;
                        for (var i = 1; i < startOfPDF.Length; i++)
                        {
                            //If we found a char that isn't in the PDF signiture, then resume the while loop
                            //to start searching again from the next position
                            if (!reader.ReadChar().Equals(startOfPDF[i]))
                            {
                                reader.BaseStream.Position = currBasePos;
                                break;
                            }
                            //If we've reached the end of the PDF signiture then we've found a match
                            if (i == startOfPDF.Length - 1)
                            {
                                //Success
                                //Set the Position to the start of the PDF signiture 
                                searchingForstartOfPDF = false;
                                reader.BaseStream.Position -= startOfPDF.Length;
                                //We are no longer searching for the PDF Signiture so 
                                //the remaining bytes in the file will be directly wrote
                                //using the stream writer
                            }
                        }
                    }
                }
                else
                {
                    //We are writing the binary now
                    writer.Write(reader.ReadByte());
                }
            }
    
        }
    }
    

    This code example uses the BinaryReader to read each char 1 by 1 and if it finds a match of the string %PDF- (The PDF Start Signature) it will move the reader position back to the % and then write the remaining document using writer.Write(reader.ReadByte()).