Search code examples
unicodecastinglazarusfreepascalfpc

When is it safe to cast UnicodeString to string in Free Pascal 3?


This unit test runs successfully with Free Pascal 3.0 in Delphi mode:

procedure TFreePascalTests.TestUTF8Decode;
var
  Raw: RawByteString;
  Actual: string;
begin
  Raw := UTF8Encode('关于汉语');

  Actual := string( UTF8Decode(Raw) ); // <--- cast from UnicodeString

  CheckEquals('关于汉语', Actual);

  // check Windows ANSI code page 
  CheckEquals(1252, GetACP);
  // check Free Pascal value (determines how CP_ACP is interpreted)
  CheckEquals(65001, DefaultSystemCodePage); 
end; 

UTF8Decode returns a UnicodeString. Without the hard type cast to string, the compiler warns about an unsafe conversion:

Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"

(tested with Lazarus 1.6 / FPCUnit GUITestrunner)

As per http://wiki.freepascal.org/Character_and_string_types#String, the string type defaults to AnsiString (if the {$H+} switch is set to use AnsiString instead of ShortString).

It looks like Free Pascal stores the Unicode string in the AnsiString variable. (even without the cast, the test succeeds)

Question: as the test succeeds, can I assume that it is safe to use the cast (to suppress the warning) without risking data loss?


Solution

  • The cast is not safe in general as you are still converting the UnicodeString into an AnsiString and the encoding of an AnsiString is not known at compile time. The warning goes only away as you are doing it explicitly and the compiler assumes you know what you do.

    If the cast works depends on the encoding setting on your system: It is either UTF-8, then Actual contains the string UTF-8 encoded and it works or the particular locale on your system supports the characters you are using. If you run this code on a system with e. g. CP1250, it will fail. The governing variable is DefaultSystemCodePage. On startup it is initialized by the FPC RTL using the encoding of the system. However, there are frameworks (like the LCL) which override this and set it to e. g. UTF-8.

    Use {$modeswitch unicodestrings} in addition to {$mode delphi} and string equals to unicodestring, so the encoding will be locale independent.