Search code examples
c#pdfitextpdfboxembedded-fonts

Issues Decoding Flate from PDF Embedded Font


Ok, before we start. I work for a company that has a license to redistribute PDF files from various publishers in any media form. So, that being said, the extraction of embedded fonts from the given PDF files is not only legal - but also vital to the presentation.

I am using code found on this site, however I do not recall the author, when I find it I will reference them. I have located the stream within the PDF file that contains the embedded fonts, I have isolated this encoded stream as a string and then into a byte[]. When I use the following code I get an error

Block length does not match with its complement.

Code (the error occurs in the while line below):

private static byte[] DecodeFlateDecodeData(byte[] data)
{
    MemoryStream outputStream;
    using (outputStream = new MemoryStream())
    {
        using (var compressedDataStream = new MemoryStream(data))
        {
            // Remove the first two bytes to skip the header (it isn't recognized by the DeflateStream class)
            compressedDataStream.ReadByte();
            compressedDataStream.ReadByte();

            var deflateStream = new DeflateStream(compressedDataStream, CompressionMode.Decompress, true);
            var decompressedBuffer = new byte[compressedDataStream.Length];
            int read;

            // The error occurs in the following line
            while ((read = deflateStream.Read(decompressedBuffer, 0, decompressedBuffer.Length)) != 0)
            {
                outputStream.Write(decompressedBuffer, 0, read);
            }
            outputStream.Flush();
            compressedDataStream.Close();
        }

        return ReadFully(outputStream);
    }
}

After using the usual tools (Google, Bing, archives here) I found that the majority of the time that this occurs is when one has not consumed the first two bytes of the encoding stream - but this is done here so i cannot find the source of this error. Below is the encoded stream:

H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö   m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü    ?»„O Ê£ðÅ­P9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

Please help, I am beating my head against the wall here!

NOTE: The stream above is the encoded version of Arial Black - according to the specs inside the PDF:

661 0 obj
<< 
/Type /FontDescriptor 
/FontFile3 662 0 R 
/FontBBox [ -194 -307 1688 1083 ] 
/FontName /HLJOBA+ArialBlack 
/Flags 4 
/StemV 0 
/CapHeight 715 
/XHeight 518 
/Ascent 0 
/Descent -209 
/ItalicAngle 0 
/CharSet (/space/T/e/s/t/a/k/i/n/g/S/r/E/x/m/O/u/l)
>> 
endobj
662 0 obj
<< /Length 1700 /Filter /FlateDecode /Subtype /Type1C >> 
stream
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö   m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü    ?»„O Ê£ðÅ­P9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

Solution

  • Okay, for anyone who might stumble across this issue themselves allow me to warn you - this is a rocky road without a great deal of good solutions. I eventually moved away from writing all of the code to extract the fonts myself. I simply downloaded MuPDF (open source) and then made command line calls to mutool.exe:

        mutool extract C:\mypdf.pdf
    

    This pulls all of the fonts into the folder mutool resides in (it also extracts some images (these are the fonts that could not be converted (usually small subsets I think))). I then wrote a method to move those from that folder into the one I wanted them in.

    Of course, to convert these to anything usable is a headache in itself - but I have found it to be doable.

    As a reminder, font piracy IS piracy.