Search code examples
arraysstring-conversion

What occurs when a string is converted to a byte array


I think that this is a newbie type question but I have quite understood this.

I can find many posts on how to convert a string to a byte array in various languages.

What I do not understand is what is happening at a character by character basis. I understand that each character displayed on the screen is represented by a number such as it's ascii code. (Can we stick with ASCII at the moment so I get this conceptually :-))

Does this mean that when I want to represent a character or a string (which is a list of chartacters) the following occurs

Convert character to ASCII value > represent ascii value as binary?

I have seen code that creates Byte arrays by defining the byte array as 1/2 the length of the input string so surely a byte array would be the same length of string?

So I am a little confused. Basically I am trying to store a sting value into a byte array in ColdFusion which I cannot see has an explicit string to byte array function.

However I can get to the underlying java but I need to know whats happening at the theoretical level.

Thanks in advance and please tell me nicely if you think I am barking mad !!

Gus


Solution

  • In Java, strings are stored as an array of 16-bit char values. Each Unicode character in the string is stored as one or (rarely) two char values in the array.

    If you want to store some string data in a byte array, you will need to be able to convert the string's Unicode characters into a sequence of bytes. This process is called encoding and there are several ways to do it, each with different rules and results. If two pieces of code want to share string data using byte arrays, they need to agree on which encoding is being used.

    For example, suppose we have a string s that we want to encode using the UTF-8 encoding. UTF-8 has the convenient property that if you use it to encode a string that contains only ASCII characters, every character in the input gets converted to a single byte with that character's ASCII value. We might convert our Java string to a Java byte array as follows:

    byte[] bytes = s.getBytes("UTF-8");
    

    The byte array bytes now contains the string data from s, encoded into bytes using the UTF-8 encoding.

    Now, we store or transmit the bytes somewhere, and the code on the other end wants to decode the bytes back into a Java String. It will do something like the following:

    String t = new String(bytes, "UTF-8");
    

    Assuming nothing went wrong, the string t now contains the same string data as the original string s.

    Note that both pieces of code had to agree on what encoding was being used. If they disagreed, the resulting string might end up containing garbage, or might even fail to decode at all.