Search code examples
c++builder

How to create a TMemoryStream and next LoadFromStream into a TRichEdit with proper formatting


I have a large string of RTF formatted data. The formatting is sound, put it in an ANSI text file, rename it to *.rtf and WordPad will display it correctly.

The string is essentially a std:wstring, not a TUnicodeString

If I do the following, the text is displayed properly with proper color formatting etc:

TStringStream *Stream = new TStringStream(String(MyStr.c_str(), MyStr.size())) ;
Stream->Position = 0 ;
RichEdit1->PlainText = false ;
RichEdit1->Lines->LoadFromStream(Stream) ;
delete Stream ;

All good, it works, but I was thinking of avoiding the memcpy that takes place when the String is created, which would save some resources with particularly huge strings.

My aim was to create a TCustomMemoryStream descendant that takes MyStr as input and uses its internal memory by calling SetPointer((void*)MyStr.c_str(), MyStr.Length() * 2 /*Size in bytes*/) during construction.

This saves a memcpy if handled with care (MyStr must outlive the Stream etc.) and it's an easy and quick implementation.

Sadly .. it doesn't work properly and I can't seem to figure out why ? I have a working solution, I could move on .. but it bugs me .. so please enlighten me.

Implemented slightly differently for testing, but it comes down to the same:

TMemoryStream *Stream = new TMemoryStream() ;
Stream->Write((void*)MyStr.c_str(), MyStr.length() * 2 /*Bytes*/) ;
Stream->Position = 0 ;
RichEdit1->PlainText = false ;
RichEdit1->Lines->LoadFromStream(Stream) ;
delete Stream ;

RichEdit is unable to show the formatted text. Instead it shows the plain text (characters spaced out). I understand this to be a case of not getting the encoding right, that makes sense.

So I tell LoadFromStream() what encoding to use:

TMemoryStream *Stream = new TMemoryStream() ;
Stream->Write((void*)MyStr.c_str(), MyStr.length() * 2 /*Bytes*/) ;
Stream->Position = 0 ;
RichEdit1->PlainText = false ;
RichEdit1->Lines->LoadFromStream(Stream, TEncoding::Unicode) ;
delete Stream ;

The text is shown properly now but still as plain text, the rtf is not parsed. I don't understand why, it seems the text arrives alright in its entirety, copy pasted to a text file and compared to the earlier rtf file the content is identical.

I figured perhaps the encoding needs a BOM to work properly (since that is the default in TEncoding::Unicode), so I added one for testing:

TMemoryStream *Stream = new TMemoryStream() ;
    WORD BOM = 0xFEFF ;
    Stream->Write((void*)&BOM, 2) ;
Stream->Write((void*)MyStr.c_str(), MyStr.length() * 2 /*Bytes*/) ;
Stream->Position = 0 ;
RichEdit1->PlainText = false ;
RichEdit1->Lines->LoadFromStream(Stream, TEncoding::Unicode) ;
delete Stream ;

But it doesn't make a difference. So I tried to opposite (pass TEncoding that doesn't require a BOM):

TMemoryStream *Stream = new TMemoryStream() ;
Stream->Write((void*)MyStr.c_str(), MyStr.length() * 2 /*Bytes*/) ;
Stream->Position = 0 ;
RichEdit1->PlainText = false ;
   TUnicodeEncoding *Encoding = new TUnicodeEncoding(false /*UseBOM*/) ;
RichEdit1->Lines->LoadFromStream(Stream, Encoding) ;
   delete Encoding ;
delete Stream ;

Sadly, still just plain text

I tried a handful of other things as well in a test app, load in TMemo, Save to stream, load in RichEdit etc. (with various results), I also tried setting an Encoding during TStringStream construction with strange results, but I don't want to clutter this Q with that.

I'd like to understand why TRichEdit is unable to parse the rtf even though it seems to get all data correctly as it displays it in plain text

I'm currently using C++ Builder 12

EDIT 1 - after Remy's input

IOW, it actually converts the String to the specified (or in this case, defaulted) encoding

Oh wow, an even bigger penalty I wasn't aware of. I thought it was a means to tell TStringStream what the input encoding is rather than what it needs to be after storage

What I then find very confusing is that when I LoadFromStream but pass Unicode as encoding:

RichEdit1->Lines->LoadFromStream(Stream, TEncoding::Unicode) ;

It still works properly ? So, what is LoadFromStream doing with this encoding then ? Since the Stream has been converted to ANSI and RichEdit itself requires ANSI as well (and since that is what seems to be sent to Windows - because it works) ?

When I try the opposite, no conversion in TStringStream (and hence UTF-16 storage)

TStringStream *Stream = 
    new TStringStream(String(MyStr.c_str(), MyStr.Length()), TEncoding::Unicode, true) ;

but supposed conversion in LoadFromStream:

RichEdit1->Lines->LoadFromStream(Stream) ;

or

RichEdit1->Lines->LoadFromStream(Stream, TEncoding::Default) ;

It doesn't work, so LoadFromStream doesn't use the encoding to convert (as is the case during TStringStream construction) ?

And to add to my confusion, you mentioning that Unicode can't be used as input for LoadFromStream would mean that non-latin characters don't get converted to rtf text (unless the default TConversion takes care of it, which is doesn't - I just checked). Well actually the information was already lost before TConversion could look at it since the string was converted to ANSI in the working case, and when kept as Unicode, it didn't work at all anyways. The information gets lost during conversion to ANSI and the characters are shown as question marks.

Which is doubly confusing since adding special characters to a properly displaying rtf doc

RichEdit1->Lines->Add(L"你好") ; // Chinese simplified: Nǐ hǎo

works perfectly fine and all content is correctly preserved.

Does this mean that VCL converts special characters to rtf formatted text in case of Add() ? Or does this mean Windows' RichEdit can take Unicode as input when lines are added, which then makes me wonder if there is not also a unicode version for the in-streaming (and is VCL code not aware of this then) ?

EDIT 2

Following your suggestion to use TPointerStream and keeping in mind the RichEdit restrictions I first create an ANSI string (based on std::string) Which I then use in following way:

TPointerStream *Stream = 
    new TPointerStream((void*)MyAnsiStr.c_str(), MyAnsiStr.length(), true /*ReadOnly*/) ;

This works well with:

RichEdit1->Lines->LoadFromStream(Stream) ;

but it also (unexpectedly) works well with:

RichEdit1->Lines->LoadFromStream(Stream, TEncoding::Unicode) ;

Which I don't understand, since the input is not Unicode and since the output (to RichEdit) is not Unicode


Solution

  • My aim was to create a TCustomMemoryStream descendant that takes MyString as input and uses its internal memory

    Note that the RTL already has a class for that very purpose - TPointerStream.

    I'd like to understand why TRichEdit is unable to parse the rtf even though it seems to get all data correctly as it displays it in plain text

    TStringStream defaults to using TEncoding::Default when storing the String into its memory. IOW, it actually converts the String to the specified (or in this case, defaulted) encoding, and then stores the converted bytes.

    And when the TRichEdit::Lines::LoadFromStream() method is loading a TStream, it also assumes TEncoding::Default when no TEncoding is specified explicitly, and no BOM is present in the stream data.

    That is why your TStringStream test worked. Your String got converted to an encoding that LoadFromStream() was expecting.

    However, on Windows TEncoding::Default is the same as TEncoding::ANSI. If you store UTF-16 encoded bytes in your TMemoryStream, that will not match what TEncoding::ANSI is expecting, so you would have to be explicit about the actual encoding you want to use.

    Now, when you did specify the encoding explicitly, things still didn't work, because when PlainText is false then TRichEdit uses SF_RTF without SF_UNICODE when issuing the EM_STREAMIN window message to itself. SF_UNICODE is used only when PlainText is true (SF_TEXT instead of SF_RTF). RTF is a 7bit ASCII format, and SF_RTF can't handle UTF-16 (which is also why your TStringStream test worked).

    When SF_RTF fails, TRichEdit will reattempt again with SF_TEXT and SF_UNICODE instead, which is then why you end up with the plain text version of your RTF.

    So, in short, you should not use UTF-16 data when using PlainText=false. But, if you really want to use UTF-16 encoded RTF, you will have to implement a custom TConversion descendant, and then assign that class type to the TRichEdit::DefaultConverter property.