Answering haskell-convert-unicode-sequence-to-utf-8 I came upon some strange behaviour of ByteString.putStrLn
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Text (Text)
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B
inputB, inputB' :: ByteString
inputB = "ДЕЖЗИЙКЛМНОПРСТУФ"
inputB' = "test"
main :: IO ()
main = do putStr "B.putStrLn inputB: "; B.putStrLn inputB
putStr "print inputB: "; print inputB
putStr "B.putStrLn inputB': "; B.putStrLn inputB'
putStr "print inputB': "; print inputB'
which yields
B.putStrLn inputB:
rint inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
B.putStrLn inputB': test
print inputB': "test"
what I do not understand here is - why the first output line is missing and the p in print on the second line is missing.
My guess would be that this has something to do with the russian letters leading to malformed input. Because with the simple case of "test" it just works.
xxd output
> stack exec -- unicode | xxd
00000000: 422e 7075 7453 7472 4c6e 2069 6e70 7574 B.putStrLn input
00000010: 423a 2014 1516 1718 191a 1b1c 1d1e 1f20 B: ............
00000020: 2122 2324 0a70 7269 6e74 2069 6e70 7574 !"#$.print input
00000030: 423a 2022 5c44 4334 5c4e 414b 5c53 594e B: "\DC4\NAK\SYN
00000040: 5c45 5442 5c43 414e 5c45 4d5c 5355 425c \ETB\CAN\EM\SUB\
00000050: 4553 435c 4653 5c47 535c 5253 5c55 5320 ESC\FS\GS\RS\US
00000060: 215c 2223 2422 0a42 2e70 7574 5374 724c !\"#$".B.putStrL
00000070: 6e20 696e 7075 7442 273a 2074 6573 740a n inputB': test.
00000080: 7072 696e 7420 696e 7075 7442 273a 2022 print inputB': "
00000090: 7465 7374 220a test".
libraries
> stack exec -- ghc-pkg list
/opt/ghc/7.10.3/lib/ghc-7.10.3/package.conf.d
Cabal-1.22.5.0
array-0.5.1.0
base-4.8.2.0
bin-package-db-0.0.0.0
binary-0.7.5.0
bytestring-0.10.6.0
containers-0.5.6.2
deepseq-1.4.1.1
directory-1.2.2.0
filepath-1.4.0.0
ghc-7.10.3
ghc-prim-0.4.0.0
haskeline-0.7.2.1
hoopl-3.10.0.2
hpc-0.6.0.2
integer-gmp-1.0.0.0
pretty-1.1.2.0
process-1.2.3.0
rts-1.0
template-haskell-2.10.0.0
terminfo-0.4.0.1
time-1.5.0.1
transformers-0.4.2.0
unix-2.7.1.0
xhtml-3000.2.1
/home/epsilonhalbe/.stack/snapshots/x86_64-linux/lts-5.5/7.10.3/pkgdb
text-1.2.2.0
/home/epsilonhalbe/programming/unicode/.stack-work/install/x86_64-linux/lts-5.5/7.10.3/pkgdb
and the locale
> locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_AT.UTF-8
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_AT.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_AT.UTF-8
LC_NAME=de_AT.UTF-8
LC_ADDRESS=de_AT.UTF-8
LC_TELEPHONE=de_AT.UTF-8
LC_MEASUREMENT=de_AT.UTF-8
LC_IDENTIFICATION=de_AT.UTF-8
LC_ALL=
It is not a terminal problem, rather, the problem happens early in the conversion to ByteString. Remember, because you used OverloadedStrings
inputB = "ДЕЖЗИЙКЛМНОПРСТУФ"
is really shorthand for
inputB = fromString "ДЕЖЗИЙКЛМНОПРСТУФ"::ByteString
which does not convert to a bytestring using UTF8.
If, instead, you want the bytestring to contain utf8 encoded chars, use
import qualified Data.ByteString.UTF8 as BU
inputB = BU.fromString "ДЕЖЗИЙКЛМНОПРСТУФ"
then this will work
B.putStrLn inputB
Why is the "p" on line two missing?
I won't go into detail (because I don't know them), but the behavior is expected.... Because your terminal is expecting UTF8, and the Russian string is not UTF8.
UTF8 uses variable length byte character encodings.... Depending on the first byte in a char, it might expect more. Clearly the last byte in the Russian string started a UTF8 encoding that required more bytes, and the "p" was read in to that char. Your terminal seems to just ignore chars it can't print (mine prints garbage), so both the Russian string and the next char were lost.
You will note that the "p" is in the xxd output.... The terminal just considering it to be part of the unknown chars and not printing it.