Search code examples
freepascallazarus

how to get the real file contents using TFilestream?


i try to get the file contents using TFilestream:

procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
   Fs   := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone); 
   SetLength(tr, Fs.Size);
   Fs.Read(tr[1], Fs.Size);
   Showmessage(tr); 
   Fs.Free;
end;

I do a little text file with contents only: aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa

  1. And save this file (using AkelPad) with 1251 (ansi) codepege
  2. Save with 65001 (UTF8) codepage.

these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents

But when i run ShowFileCont proc it shows to me different results:

  1. aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
  2. aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa

Questions:

  1. how to get the real file contents using TFilestream?
  2. How to explain that these 2 files has different size but the content (in notepad) is equeal?

Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string


Solution

  • Why do the files have different size?

    Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.

    How do I get the true file contents?

    You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.

    Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.

    But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.

    In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.

    How can I determine from the file what encoding is used?

    You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.

    It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.

    The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.