Search code examples
c#itext

Create/Define/Add new encoding


I'm having a problem where, reading a PDF with iText, I get:

"'StandardEncoding' is not a supported encoding name."

The PDF is fine - it's just text. Using iText, I can get it into a PdfPage to read page-by-page, but when I use PdfTextExtractor.GetTextFromPage to read into a string, I get this error. There doesn't seem to be a way to specify what encoding to read as in this function and the error also points me to documentation for the Encoding.RegisterProvider method.

Looking at Encoding.RegisterProvider, the only option I see there is to call CodePagesEncodingProvider.Instance. Calling this does not add ISO-32000-1. Even if I knew what went into creating an EncodingProvider, it won't let me create a new instance of it.

Another StackOverflow answer says to create a new encoding with the name StandardEncoding. I can't figure out any way to do this. If I create an Encoding, I get read-only WebName, BodyName, HeaderName, and EncodingName attributes, not anything I can write to.

Every character in the document has the same character code between STD and WinAnsi, so if all else fails, theoretically I could pretend it's 1252. Last resort, but even this doesn't work. If I try to convert it as 1252, I get the same error about StandardEncoding not being valid. Using GetContentBytes instead of GetTextFromPage doesn't seem to get me anything useful, I assume because it would be getting the entire page, not just its text.

I thought this would be a lot simpler since I'm just dealing with a very simple PDF, but this is surprisingly difficult to even search for - I've been reading the same pages over and over. Has anyone had this issue before and figured it out?


Solution

  • You can use Encoding.RegisterProvider to register a custom provider:

    Encoding.RegisterProvider(StandardEncodingProvider.Instance);
    

    Here's an example implementation. I'm not sure which encoding should be used, so I use ISO-8859-1. For all other encodings, it returns null to use the default provider.

    public sealed class StandardEncodingProvider : EncodingProvider
    {
        public static readonly StandardEncodingProvider Instance = new StandardEncodingProvider();
    
        public static readonly Encoding StandardEncoding = Encoding.GetEncoding("ISO-8859-1");
    
        public override Encoding GetEncoding(string name)
        { 
            if (StringComparer.OrdinalIgnoreCase.Equals(name, "StandardEncoding"))
            {
                return StandardEncoding;
            }
            return null;
        }
    
        public override Encoding GetEncoding(int codepage)
        {
            return null;
        }
    }