Search code examples
delphiunicodeutf-8delphi-2009rawbytestring

Delphi 2009 RawByteString vagaries


Suppose that for some perverse reason you want to display the raw byte contents of a UTF8String.

var
  utf8Str : UTF8String;
begin    
  utf8Str := '€ąćęłńóśźż';
end;

(1) This doesn't do, it displays the readable form:

memo1.Lines.Add( RawByteString( utf8Str ));
// output: '€ąćęłńóśźż'

(2) This, however, does "work" - note the concatenation:

memo1.Lines.Add( 'x' + RawByteString( utf8Str ));
// output: 'x€ąćęłńóśźż'

I understand (1), though the compiler's forced coerction to UnicodeString seems to prevent ever displaying a RawByteString var as-is. However, why does the behavior change in (2)?

(3) Stranger still - let's reverse the concatenation:

memo1.Lines.Add( RawByteString( utf8Str ) + 'x' ); 
// output: '€ąćęłńóśźżx'

I've been reading up on the newfangled string types in Delphi and thought I understood how they work, but this is a puzzle.


Solution

  • RawByteString only exists to minimize the number of overloads required for functions that work with various flavours of AnsiStrings with different codepage affinities.

    In general, don't declare variables of type RawByteString. Don't typecast values to that type. Don't do concatenations on variables of that type. About the only things you can do are:

    • Declaring a parameter of this type (the original intent)
    • Indexing on such a parameter
    • Searching in such a parameter
    • Intelligent operations that check the actual code page of the string, using the StringCodePage function.

    For example, you'll note that the StringCodePage function itself uses RawByteString as its argument type. This way, it will work with any AnsiString, rather than doing a codepage translation before passing it as an argument.

    For your case, things like concatenations are largely undefined. The behaviour changed between RTM and Update 2, but when the RTL string concatenation functions receive multiple strings with different code pages, there's no easy way for it to figure out what code page should be used for the final string. That's just one reason why you shouldn't concatenate them like you do here.