Search code examples
htmlhaskellhtml-parsingcontent-typenon-ascii-characters

Why can Haskell not handle characters from a specific website?


I was wondering if I can write a Haskell program to check updates of some novels on demand, and the website I am using as an example is this. And I got a problem when displaying the contents of it (on a mac el capitan). The simple codes follow:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

display :: String -> IO ()
display = (>>= putStrLn) . openURL

Then, when I run display "http://www.piaotian.net/html/7/7430/" on ghci, some strange characters appear; the first lines look like this:

<title>×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´°È«ÎÄÔĶÁ_Æ®ÌìÎÄѧ</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="×ß½øÐÞÏÉ,×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´° Æ®ÌìÎÄѧ" />
<meta name="description" content="Æ®ÌìÎÄѧÍøÌṩ×ß½øÐÞÏÉ×îÐÂÕ½ÚÃâ·ÑÔĶÁ£¬Ç뽫×ß½øÐÞÏÉÕ½ÚĿ¼¼ÓÈëÊղط½±ãÏ´ÎÔĶÁ,Æ®ÌìÎÄѧС˵ÔĶÁÍø¾¡Á¦ÔÚµÚһʱ¼ä¸üÐÂС˵×ß½øÐÞÏÉ£¬Èç·¢ÏÖδ¼°Ê±¸üУ¬ÇëÁªÏµÎÒÃÇ¡£" />
<meta name="copyright" content="×ß½øÐÞÏÉ°æȨÊôÓÚ×÷ÕßÎáµÀ³¤²»¹Â" />
<meta name="author" content="ÎáµÀ³¤²»¹Â" />
<link rel="stylesheet" href="/scripts/read/list.css" type="text/css" media="all" />
<script type="text/javascript">

I also tried to download as a file as follows:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

downloading :: String -> IO ()
downloading = (>>= writeFile fileName) . openURL

But after downloading the file, it is like in the photo: enter image description here

If I download the page by python (using urllib for example) the characters are displayed normally. Also, if I write a Chinese html and parse it, then there seems to be no problem. Thus it seems that the problem is on the website. However, I don't see any difference between the characters of the site and those I write.

Any help on the reason behind this is well appreciated.

P.S.
The python code is as follows:

import urllib

urllib.urlretrieve('http://www.piaotian.net/html/7/7430/', theFic)

theFic = file_path

And the file is all fine and good.


Solution

  • Since you said you are interested in just the links, there is no need to convert the GBK encoding to Unicode.

    Here is a version which prints out all links like "123456.html" in the document:

    #!/usr/bin/env stack
    {- stack
      --resolver lts-6.0 --install-ghc runghc
      --package wreq --package lens
      --package tagsoup
    -}
    
    {-# LANGUAGE OverloadedStrings #-}
    
    import Network.Wreq
    import qualified Data.ByteString.Lazy.Char8 as LBS
    import Control.Lens
    import Text.HTML.TagSoup
    import Data.Char
    import Control.Monad
    
    -- match \d+\.html
    isNumberHtml lbs = (LBS.dropWhile isDigit lbs) == ".html"
    
    wanted t = isTagOpenName "a" t && isNumberHtml (fromAttrib "href" t)
    
    main = do
      r <- get "http://www.piaotian.net/html/7/7430/"
      let body = r ^. responseBody :: LBS.ByteString
          tags = parseTags body
          links = filter wanted tags
          hrefs = map (fromAttrib "href") links
      forM_ hrefs LBS.putStrLn