UTF-8 string has too many bytes using SBCL and babel on Windows 64 bits

The UTF-8 string in example seems to be coded with too many bytes!

The input string: "👉TEST📍TEST"

“👉” (U+1F449): A hand pointing right
“T”, “E”, “S”, “T”: Basic Latin letters
“📍” (U+1F4CD): A round pushpin
“T”, “E”, “S”, “T”: Basic Latin letters

This string is stored in a UTF-8 encoded file, when I use a hexadecimal editor I see the 16 bytes below as expected. When I copy the strings into Online tools, I find the same 16 bytes.

f0 9f 91 89 54 45 53 54 f0 9f 93 8d 54 45 53 54
 \_______/   \_______/   \_______/   \_______/
  U+1F449    T  E  S  T   U+1F4CD    T  E  S  T
   “👉”                    “📍”

However, the results of the function babel:string-to-octets are different, I get 20 bytes:

(defun print-hex (octets)
  (dotimes (offset (length octets))
    (let ((byte (aref octets offset)))
      (format t "~2,'0x " byte)))
  (format t "(~A bytes)~%" (length octets)))

(let ((string "👉TEST📍TEST"))
  (format t "TEST STRING [~A]~%" string)
  (print-hex (babel:string-to-octets string))
  (print-hex (babel:string-to-octets string :encoding :UTF-8)))
TEST STRING [👉TEST📍TEST]
ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)
ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)

If we analyze this further:

ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54
 \_____________/   \_______/   \_____________/   \_______/
       ???         T  E  S  T       ???          T  E  S  T 
       ^^^                          ^^^
UTF-16 surrogate pair?       UTF-16 surrogate pair?

How do I get the 16 bytes from the input string?

Another interesting behavior which highlight the same issue, converting to octets and then back to the original string leads to an encoding error on the first character.

(let ((string "👉TEST📍TEST"))
  (babel:octets-to-string (babel:string-to-octets string)))

debugger invoked on a BABEL-ENCODINGS:CHARACTER-OUT-OF-RANGE in thread
#<THREAD "main thread" RUNNING {100F080003}>:
  Illegal :UTF-8 character starting at position 0.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

Edit: the issue seems to be specific to SBCL on Windows, the program runs well on Debian Linux.

Solution

I'm pretty sure that this is a problem with the SBCL repl itself, and possibly a problem with the way that you are introducing strings into your code.

As far as the repl is concerned, the SBCL repl is not really actively developed; most lispers are probably using Slime or something similar for repl development. This is a much better experience than working with the SBCL repl. I couldn't get the posted code to misbehave in a Slime repl.

I was able to reproduce the problem with an SBCL repl. On my Windows machine, it seems that pasting the posted string literal into an SBCL repl window resulted in a string which is UTF-16 encoded. This is where I suspect there is some issue with the SBCL repl. Calling babel:string-to-octets on the pasted string yields the wrong result, as OP noted. SBCL has its own sb-ext:string-to-octets procedure, and calling that on the pasted string drops into the debugger with an SB-IMPL::OCTETS-ENCODING-ERROR error. This makes me think that the problem is somewhere on the SBCL side.

As a workaround, I was able to round-trip the pasted string through a UTF-16 encoding using babel:

;; Calling on a pasted string literal:
* (print-hex (babel:string-to-octets "��TEST��TEST"))
ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)
NIL

;; Round-tripping the pasted string literal:
* (print-hex (babel:string-to-octets
              (babel:octets-to-string
               (babel:string-to-octets "��TEST��TEST" :encoding :utf-16)
               :encoding :utf-16)))
F0 9F 91 89 54 45 53 54 F0 9F 93 8D 54 45 53 54 (16 bytes)
NIL

* (let* ((s "��TEST��TEST")
         (s-reencoded (babel:octets-to-string
                       (babel:string-to-octets s :encoding :utf-16)
                      :encoding :utf-16)))
    (format t "TEST STRING [~A]~%" s)
    (print-hex (babel:string-to-octets s-reencoded)))
TEST STRING [👉TEST📍TEST]
F0 9F 91 89 54 45 53 54 F0 9F 93 8D 54 45 53 54 (16 bytes)
NIL
*

Note that I was unable to make the same round-tripping work by using SBCL's sb-ext:string-to-octets and sb-ext:octets-to-string procedures.

The OP has said: "This string is stored in a UTF-8 encoded file." The significance of this is unclear. Was the posted code saved in a file and loaded into a repl? I saved the posted code in a file using Emacs and Slime, using Windows Notepad with UTF-8 encoding, and using Windows Notepad with UTF-16 encoding. Every time I loaded this code from any of these files into either the SBCL repl or the Slime repl it worked as expected. This leads me to believe that the problem may be an inconvenience for playing in the repl, but not an issue for real programs.