I would like to read UTF-8 characters from a gzip-file as easily as from a normal text file.
Unfortunately with-open-gzip-file does not seem to work as expected.
I’ve tried this:
CL-USER> (require :gzip-stream)
NIL
CL-USER> (with-open-file (in "test-utf8.txt") (read-line in))
"abéè"
NIL
CL-USER> (gzip-stream:with-open-gzip-file (in "test-utf8.txt.gz") (read-line in))
"abéè"
NIL
I was expecting "abéè" instead of "abéè".
Is gzip-stream broken, and I should use another package, or is there some configuration, that I’m missing?
TIA for any hints, Peter
Digging around in the source, it looks like gzip-stream
's implementation of read-char
(And thus read-line
) reads a single byte and converts that to a character with code-char
; it will thus fail badly with any multibyte character encoding like UTF-8 (The stream classes inherit from fundamental-binary-input-stream
/fundamental-binary-output-stream
suggesting they're not really intended to read characters from in the first place).
One workaround is to read bytes from the decompressed stream instead of characters, and decode those into a string via other means. For example, in CCL:
CL-USER> (ql:quickload '(:alexandria :gzip-stream))
CL-USER> (gzip-stream:with-open-gzip-file (in "test-utf8.txt.gz")
(decode-string-from-octets
(alexandria:read-stream-content-into-byte-vector in)
:external-format :utf-8))
"abéè"
6
SBCL has a sb-ext:octets-to-string
function that works the same way.
In the case where reading the entire decompressed file into memory like the above isn't desired (There's a reason it's compressed, right?), the flexi-streams
package lets you wrap a byte-oriented stream like the ones created by gzip-stream
into a character-oriented one that can handle the conversion on demand so you can read line at a time:
CL-USER> (gzip-stream:with-open-gzip-file (raw "test-utf8.txt.gz")
(let ((utf8 (flexi-streams:make-flexi-stream raw :external-format :utf-8)))
(read-line utf8)))
"abéè"
NIL