Search code examples
delphiindydelphi-xe6

Delphi & Indy & utf8


i have a problem to access into websites whit utf8 charset, for example when i try to accesso at this www

Click for example

all utf8 characters are not correctly codified. This is my access routine:

var
  Web     : TIdHTTP;
  Sito    : String;
  hIOHand : TIdSSLIOHandlerSocketOpenSSL;

begin
  Url := TIdURI.URLEncode(Url);


  try
    Web := TIdHTTP.Create(nil);
    hIOHand := TIdSSLIOHandlerSocketOpenSSL.Create(nil);
    hIOHand.DefStringEncoding := IndyTextEncoding_UTF8;
    hIOHand.SSLOptions.SSLVersions := [sslvTLSv1,sslvTLSv1_1,sslvTLSv1_2,sslvSSLv2,sslvSSLv3,sslvSSLv23];
    Web.IOHandler := hIOHand;
    Web.Request.CharSet := 'utf-8';


    Web.Request.UserAgent := INET_USERAGENT;       //Custom user agent string
    Web.RedirectMaximum := INET_REDIRECT_MAX;      //Maximum redirects
    Web.HandleRedirects := INET_REDIRECT_MAX <> 0; //Handle redirects
    Web.ReadTimeOut := INET_TIMEOUT_SECS * 1000;   //Read timeout msec
    try
      Sito := Web.Get(Url);
      Web.Disconnect;
    except
      on e : exception do
        Sito := 'ERR: ' +Url+#32+e.Message;
    end;
  finally
    Web.Free;
    hIOHand.Free;
  end;

I try all solution but in the Sito var i find alltime wrong characthers, for example correct value of the "name" is

"name": "Aire d'adhésion du Parc national du Mercantour",

but after the Get instruction i have

"name": "Aire d'adhésion du Parc national du Mercantour",

Do you have idea where is my error? Thankyou all!


Solution

  • In Delphi 2009+, which includes XE6, string is a UTF-16 encoded UnicodeString.

    You are using the overloaded version of TIdHTTP.Get() that returns a string. It decodes the sent text to UTF-16 using whatever charset is reported by the response. If the text is not decoding properly, it likely means the response is not reporting a correct charset. If the wrong charset is used, the text will not decode properly.

    The URL in question is, in fact, sending a response Content-Type header that is set to application/json without specifying a charset at all. The default charset for application/json is UTF-8, but Indy does not know that, so it ends up using its own internal default instead, which is not UTF-8. That is why the text is not decoding properly when non-ASCII characters are present.

    In which case, if you KNOW the charset will always be UTF-8, you have a few workarounds to choose from:

    • you can set Indy's default charset to UTF-8 by setting the global GIdDefaultTextEncoding variable in the IdGlobal unit:

      GIdDefaultTextEncoding := encUTF8;
      
    • you can use the TIdHTTP.OnHeadersAvailable event to change the TIdHTTP.Response.Charset property to 'utf-8' if it is blank or incorrect.

      Web.OnHeadersAvailable := CheckResponseCharset;
      
      ...
      
      procedure TMyClass.CheckResponseCharset(Sender: TObject; AHeaders: TIdHeaderList; var VContinue: Boolean);
      var
        Response: TIdHTTPResponse;
      begin
        Response := TIdHTTP(Sender).Response;
        if IsHeaderMediaType(Response.ContentType, 'application/json') and (Response.Charset = '') then
          Response.Charset := 'utf-8';
        VContinue := True;
      end;
      
    • you can use the other overloaded version of TIdHTTP.Get() that fills an output TStream instead of returning a string. Using a TMemoryStream or TStringStream, you can decode the raw bytes yourself using UTF-8:

      MStrm := TMemoryStream.Create;
      try
        Web.Get(Url, MStrm);
        MStrm.Position := 0;
        Sito := ReadStringFromStream(MStrm, IndyTextEncoding_UTF8);
      finally
        SStrm.Free;
      end;
      

      SStrm := TStringStream.Create('', TEncoding.UTF8);
      try
        Web.Get(Url, SStrm);
        Sito := SStrm.DataString;
      finally
        SStrm.Free;
      end;