Search code examples
delphiutf-8text-filesdelphi-xe2

TFile.ReadAllText with TEncoding.UTF8 omits first 3 chars


I have a UTF-8 text file which starts with this line:

<HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>

When I read this file with TFile.ReadAllText with TEncoding.UTF8:

MyStr := TFile.ReadAllText(ThisFileNamePath, TEncoding.UTF8);

then the first 3 characters of the text file are omitted, so MyStr results in:

'AD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>...'

However, when I read this file with TFile.ReadAllText without TEncoding.UTF8:

MyStr := TFile.ReadAllText(ThisFileNamePath);

then the file is read completely and correctly:

<HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>...

Does TFile.ReadAllText have a bug?


Solution

  • The first three bytes are skipped because the RTL code assumes that the file contains a UTF-8 BOM. Clearly your file does not.

    The TUTF8Encoding class implements a GetPreamble method that specifies the UTF-8 BOM. And ReadAllBytes skips the preamble specified by the encoding that you pass.

    One simple solution would be to read the file into a byte array and then use TEncoding.UTF8.GetString to decode it into a string.

    var
      Bytes: TBytes;
      Str: string;
    ....
    Bytes := TFile.ReadAllBytes(FileName);
    Str := TEncoding.UTF8.GetString(Bytes);
    

    An more comprehensive alternative would be to make a TEncoding instance that ignored the UTF-8 BOM.

    type
      TUTF8EncodingWithoutBOM = class(TUTF8Encoding)
      public
        function Clone: TEncoding; override;
        function GetPreamble: TBytes; override;
      end;
    
    function TUTF8EncodingWithoutBOM.Clone: TEncoding;
    begin
      Result := TUTF8EncodingWithoutBOM.Create;
    end;
    
    function TUTF8EncodingWithoutBOM.GetPreamble: TBytes;
    begin
      Result := nil;
    end;
    

    Instantiate one of these (you only need one instance per process) and pass it to TFile.ReadAllText.

    The advantage of using a singleton instance of TUTF8EncodingWithoutBOM is that you can use it anywhere that expects a TEncoding.