Search code examples
utf-8common-lispbinaryfiles

unpacking binary file via octets->string->unpack fails: signed int `#(243 0)` is illegal UTF8


I am parsing a binary file (nifti) with a mix of chars, floats, ints, and shorts (using the PDL::IO::Nifti cpan module as reference).

I am having some luck parsing sequences of octets to a string so they can be passed to cl-pack:unpack. This is convoluted but convenient for porting using the perl module as reference.

This strategy fails on reading #(243 0) as binary

(setf my-problem (make-array 2
                             :element-type '(unsigned-byte 8)
                             :initial-contents #(243 0)))
(babel:octets-to-string my-problem)

Illegal :UTF-8 character starting at position 0

and, when trying to read the file as char*

the octet sequence #(243 0 1 0) cannot be decoded.

I'm hoping there is a simple encoding issue I haven't figured out. Trying to go in the reverse direction (packing 243 and getting octets) gives a vector of length 3 for what I expect to be 2.

(babel:string-to-octets (cl-pack:pack "s" 243))
; yields #(195 179 0) expect #(243 0)

Full context

;; can read up to position 40. at which we expect 8 signed ints. 
;; 4th int is value "243" but octet cannot be parsed
(setq fid-bin (open "test.nii" :direction :input :element-type 'unsigned-byte))
(file-position fid-bin 40)
(setf seq (make-array (* 2 8) :element-type '(unsigned-byte 8)))
(read-sequence seq fid-bin) 
; seq: #(3 0 0 1 44 1 243 0 1 0 1 0 1 0 1 0)

(babel:octets-to-string seq) ; Illegal :UTF-8 character starting at position 6.
(sb-ext:octets-to-string seq) ; Illegal ....

;; first 3 are as expected
(cl-pack:unpack "s3" (babel:octets-to-string (subseq seq 0 6)))
; 3 256 300

(setf my-problem (subseq seq 6 8)) ; #(243 0)
(babel:octets-to-string my-problem)       ; Illegal :UTF-8 character starting at position 0.

;; checking the reverse direction
;; 243 gets represented as 3 bytes!?
(babel:string-to-octets (cl-pack:pack "s3" 3 256 300))     ; #(3 0 0 1 44 1)
(babel:string-to-octets (cl-pack:pack "s4" 3 256 300 243)) ; #(3 0 0 1 44 1 195 179 0)


(setq fid-str (open "test.nii" :direction :input))
(setf char-seq (make-array (* 2 8) :initial-element nil :element-type 'char*))
(file-position fid-str 40)
(read-sequence char-seq fid-str)
;; :UTF-8 stream decoding error on #<SB-SYS:FD-STREAM ....
;; the octet sequence #(243 0 1 0) cannot be decoded.


The perl equivalent

open my $f, "test.nii";
seek $f, 46, 0;
read $f,my $b, 2;
print(unpack "s", $b); # 243


Solution

  • It seems that the problem is indeed encoding-related:

    CL-USER> (cl-pack:pack "s" 243)
    "ó\0"
    

    which is the same as the result of:

    (babel:octets-to-string my-problem :encoding :iso-8859-1)