Search code examples
c#character-encodingftps

I need help converting a C# string from one character encoding to another?


According to Spolsky I can't call myself a developer, so there is a lot of shame behind this question...

Scenario: From a C# application, I would like to take a string value from a SQL db and use it as the name of a directory. I have a secure (SSL) FTP server on which I want to set the current directory using the string value from the DB.
Problem: Everything is working fine until I hit a string value with a "special" character - I seem unable to encode the directory name correctly to satisfy the FTP server.

The code example below

  • uses "special" character é as an example
  • uses WinSCP as an external application for the ftps comms
  • does not show all the code required to setup the Process "_winscp".
  • sends commands to the WinSCP exe by writing to the process standardinput
  • for simplicity, does not get the info from the DB, but instead simply declares a string (but I did do a .Equals to confirm that the value from the DB is the same as the declared string)
  • makes three attempts to set the current directory on the FTP server using different string encodings - all of which fail
  • makes an attempt to set the directory using a string that was created from a hand-crafted byte array - which works

Process _winscp = new Process();
byte[] buffer;

string nameFromString = "Sinéad O'Connor";
_winscp.StandardInput.WriteLine("cd \"" + nameFromString + "\"");

buffer = Encoding.UTF8.GetBytes(nameFromString);
_winscp.StandardInput.WriteLine("cd \"" + Encoding.UTF8.GetString(buffer) + "\"");

buffer = Encoding.ASCII.GetBytes(nameFromString);
_winscp.StandardInput.WriteLine("cd \"" + Encoding.ASCII.GetString(buffer) + "\"");

byte[] nameFromBytes = new byte[] { 83, 105, 110, 130, 97, 100, 32, 79, 39, 67, 111, 110, 110, 111, 114 };
_winscp.StandardInput.WriteLine("cd \"" + Encoding.Default.GetString(nameFromBytes) + "\"");

The UTF8 encoding changes é to 101 (decimal) but the FTP server doesn't like it.

The ASCII encoding changes é to 63 (decimal) but the FTP server doesn't like it.

When I represent é as value 130 (decimal) the FTP server is happy, except I can't find a method that will do this for me (I had to manually contruct the string from explicit bytes).

Anyone know what I should do to my string to encode the é as 130 and make the FTP server happy and finally elevate me to level 1 developer by explaining the only single thing a developer should understand?


Solution

  • 130 isn't ASCII (ASCII is only 7bits -- see the Encoding.ASCII documentation -- so it whacks the "é" into a normal "?" because it has nothing better to do). UTF-8 is actually encoding the character into two bytes (decimal: 195 & 169) but preserves the code-point.

    Use a code-page explicitly, such as Latin (CP 1252) -- needs to match whatever other side is. As from below, there is no "130" in the output so... not the encoding you need :-) But the same applies: use an encoding for a specific code-page.

    Edit: As Hans Passant explained in a comment, the code-page to use here is MS-DOS (CP 437) which will result in the desired results.

    // LINQPad -- Encoding is System.Text.Encoding
    var enc = Encoding.GetEncoding(1252);
    string.Join(" ", enc.GetBytes("Sinéad O'Connor")).Dump();
    // -> 83 105 110 233 97 100 32 79 39 67 111 110 110 111 114
    

    See: http://msdn.microsoft.com/en-us/goglobal/bb688114 for more.

    Happy coding.

    Btw. good selection in artists -- if it was intentional :p