Search code examples
delphiunicodeencodingasciidelphi-7

Convert unicode to ascii


I have a text file which can come in different encodings (ASCII, UTF-8, UTF-16,UTF-32). The best part is that it is filled only with numbers, for example:

192848292732

My question is: will a function like the one bellow be able to display all the data correctly? If not why? (I have loaded the file as a string into the container string)

function output(container: AnsiString): AnsiString;
var
  i: Integer;
begin 
  Result := '';
  for i := 1 to Length(container) do
    if (Ord(container[i]) <> 0) then
      Result := Result + container[i];
end;

My logic is that if the encoding is different then ASCII and UTF-8 extra characters are all 0 ?

It passes all the tests just fine.


Solution

  • The ASCII character set uses codes 0-127. In Unicode, these characters map to code points with the same numeric value. So the question comes down to how each of the encodings represent code points 0-127.

    • UTF-8 encodes code points 0-127 in a single byte containing the code point value. In other words, if the payload is ASCII, then there is no difference between ASCII and UTF-8 encoding.
    • UTF-16 encodes code points 0-127 in two bytes, one of which is 0, and the other of which is the ASCII code.
    • UTF-32 encodes code points 0-127 in four bytes, three of which are 0, and the remaining byte is the ASCII code.

    Your proposed algorithm will not be able to detect ASCII code 0 (NUL). But you state that character is not present in the file.

    The only other problem that I can see with your proposed code is that it will not recognise a byte order mark (BOM). These may be present at the beginning of the file and I guess you should detect them and skip them.

    Having said all of this, your implementation seems odd to me. You seem to state that the file only contains numeric characters. In which case your test could equally well be:

    if container[i] in ['0'..'9'] then
      .........
    

    If you used this code then you would also happen to skip over a BOM, were it present.