Search code examples
erlang

Strange behaviour of string:length function


Why does the presence of character 'â' cause this to fail.

> Bin = <<"â Hello">>.
> string:length(Bin).

** exception error: bad argument: <<"â Hello">>
     in function  string:length_1/2 (string.erl, line 557)

Whereas if this is converted to List, it worked fine.

> Str = binary_to_list(Bin).
> string:length(Str).
  7

Solution

  • The argument for string:length() can be a list of integers (where the integers are between 0...1114111) or a binary where clumps of integers must form UTF-8 characters.

    There are many ways to represent the character â (small letter a with circumflex), and two of them are:

    1. The Latin-1 integer code 226.

    2. The UTF-8 representation: 195, 162 (or in hex: C3 A2)

       5> <<195, 162>>.
       <<"â"/utf8>>
      

    When you typed in your binary, your keyboard probably entered the Latin-1 code for "small letter a with circumflex", and 226 is not the beginning of any valid UTF-8 integer code, so erlang gave you a bad argument error.

    Next, why does converting the binary to a list work? In erlang, double quotes are a shortcut for creating a list of integers:

    6> "abc" =:= [97,98,99].  (exactly equal)
    true
    

    Whenever you see double quotes in erlang, you should be thinking: "This is a list." The one exception to that rule is when you use double quotes inside a binary: instead of creating a list, you create a comma separated series of integers:

    8> <<"abc", 0>>.
    <<97,98,99,0>>
    

    Adding a 0 is a trick that forces the shell to show you what you really have. Or, you can tell erlang to quit trying to fool you with the double quotes and just show you the truth:

    18> shell:strings(false).
    true
    
    19> "abc".
    [97,98,99]
    
    20> <<226, "Hello">>.
    <<226,72,101,108,108,111>>
    

    When you convert a binary containing integers to a list (binaries can only contain integers between 0...255), then you get a list containing those same integers, and any list of integers, where the integers are between 0...1114111, is a valid argument for string:length().

    Finally, note that string:length() doesn't merely return the number of bytes in a binary:

    23> string:length(<<195,162,97,98,99>>).
    4  
    

    string:length() recognizes that the first two bytes, i.e. 195, 162, are the UTF-8 code for small letter a with circumflex, and therefore it only counts the two integers/bytes as one character. On the other hand, if you convert to a list first, string:length() returns the number of integers in the list:

    24> string:length(binary_to_list(<<195,162,97,98,99>>)).
    5
    

    ...which is the same answer you get with byte_size(Binary).