Search code examples
clojurecharacter-encodingdecodewindows-1252

Translate encoding of string


I have a string that is in Windows-1252 encoding, but needs to be converted to UTF-8.

This is for a program that fixes a UTF-8 file that has fields containing Russian text encoded in quoted-printable Windows-1252. Here's the code that decodes the quoted-printable:

(defn reencode
    [line]
    (str/replace line #"=([0-9A-Fa-f]{2})=([0-9A-Fa-f]{2})"
        (fn [match] (apply str
            (map #(char (Integer/parseInt % 16)) (drop 1 match))))))

Here's the final code:

(defn reencode
    [line]
    (str/replace line #"(=([0-9A-Fa-f]{2}))+"
        (fn [[match ignore]]
            (String.
                (byte-array (map
                    #(Integer/parseInt (apply str (drop 1 %)) 16)
                    (partition 3 match)))
                "Windows-1252"))))

It fixes the encoding using (String. ... "Encoding") on all consecutive runs of quoted-printable-encoded characters. The original function was trying to decode pairs, so it would skip things like =3D, which is the quoted-printable entity for =.


Solution

  • The best way to convert a Windows-1252 string from disk is to use the underlying Java primitives.

    (def my-string (String. bytes-from-file "Windows-1252"))
    

    will return you a Java String which has decoded the bytes with the Windows-1252 Charset. From there you can spit bytes back out with UTF-8 encoding with

    (.getBytes my-string "UTF-8")
    

    Addressing your question more closely, if you have a file with mixed encodings then you could work out what delimits each encoding and read each set of bytes in separately using the method above.

    Edit: The Windows-1252 string has been encoded with quoted printable. You will first need to decode it, using either your function or perhaps more preferably with Apache Commons Codec using QuotedPrintable decode, passing the Windows-1252 Charset. That will return a Java string which you can operate on directly with no further transformation.

    N.B. for some measure of type safety, you should probably use Java Charset objects rather than Strings when specifying the charset to use (the String class can take either).