Search code examples
utf-8common-lispgzip

Reading UTF-8 with with-open-gzip-file


I would like to read UTF-8 characters from a gzip-file as easily as from a normal text file.

Unfortunately with-open-gzip-file does not seem to work as expected.

I’ve tried this:

CL-USER> (require :gzip-stream)
NIL
CL-USER> (with-open-file (in "test-utf8.txt") (read-line in))
"abéè"
NIL
CL-USER> (gzip-stream:with-open-gzip-file (in "test-utf8.txt.gz") (read-line in))
"abéè"
NIL

I was expecting "abéè" instead of "abéè".

Is gzip-stream broken, and I should use another package, or is there some configuration, that I’m missing?

TIA for any hints, Peter


Solution

  • Digging around in the source, it looks like gzip-stream's implementation of read-char (And thus read-line) reads a single byte and converts that to a character with code-char; it will thus fail badly with any multibyte character encoding like UTF-8 (The stream classes inherit from fundamental-binary-input-stream/fundamental-binary-output-stream suggesting they're not really intended to read characters from in the first place).

    One workaround is to read bytes from the decompressed stream instead of characters, and decode those into a string via other means. For example, in CCL:

    CL-USER> (ql:quickload '(:alexandria :gzip-stream))
    CL-USER> (gzip-stream:with-open-gzip-file (in "test-utf8.txt.gz")
               (decode-string-from-octets
                 (alexandria:read-stream-content-into-byte-vector in)
                 :external-format :utf-8))
    "abéè"
    6
    

    SBCL has a sb-ext:octets-to-string function that works the same way.


    In the case where reading the entire decompressed file into memory like the above isn't desired (There's a reason it's compressed, right?), the flexi-streams package lets you wrap a byte-oriented stream like the ones created by gzip-stream into a character-oriented one that can handle the conversion on demand so you can read line at a time:

    CL-USER> (gzip-stream:with-open-gzip-file (raw "test-utf8.txt.gz")
               (let ((utf8 (flexi-streams:make-flexi-stream raw :external-format :utf-8)))
                 (read-line utf8)))
    "abéè"
    NIL