Search code examples
haskellhaskell-tagsoup

What is the best way to get data from url and parse it on Haskell?


I'm having trouble with parsing data from url.

I have url with "https://" so i think i should use import Network.HTTP.Conduit But

simpleHttp url

returns L.ByteString I really don't understand what shoud i do after that

So i have such code to get data

toStrict1 :: L.ByteString -> B.ByteString
toStrict1 = B.concat . L.toChunks

main :: IO ()
main = do
    lbs <- simpleHttp url
    let page = toStrict1 lbs

and example of parsing

    let lastModifiedDateTime = fromFooter $ parseTags doc
    putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
    where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")

How can i combine this two parts of code?


Solution

  • As you've seen, the simpleHttp function returns a lazy bytestring. There are several ways to deal with this in TagSoup.

    First, it turns out that you can parse it directly. The function parseTags has signature:

    parseTags :: StringLike str => str -> [Tag str]
    

    meaning that it can parse any type str with a StringLike instance, and if you look at the Text.StringLike module documentation, you'll see that lazy ByteStrings have a StringLike instance.

    However, if you go this route, you need to be aware that everything's kind of "trapped" in a ByteString world, so you have to write your code using versions of functions like words and unwords that are bytestring-compatible, and even your putStrLn needs an adapter. A full working example would look like this:

    import Network.HTTP.Conduit
    import Text.HTML.TagSoup
    import qualified Data.ByteString.Lazy as BL
    import qualified Data.ByteString.Lazy.Char8 as CL
    
    main :: IO ()
    main = do
        lbs <- simpleHttp "https://wiki.haskell.org"
        let lastModifiedDateTime = fromFooter $ parseTags lbs
        putStrLn $ "wiki.haskell.org was last modified on " 
            ++ CL.unpack lastModifiedDateTime
        where fromFooter = CL.unwords . drop 6 . CL.words
                  . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
    

    and it works fine:

    > main
    wiki.haskell.org was last modified on 9 September 2013, at 22:38.
    >
    

    The functions from Data.ByteString.Lazy.Char8 basically assume that the bytestring is ASCII-encoded, which is close enough for this example to work.

    However, it would be more robust to decode the bytestring based on the proper character encoding to a valid text type. The two main text types in Haskell are the default String type, which is inefficient and slow, but easy to work with, and the Text type, which is highly efficient but a bit more complicated. (Like ByteString, you need to use Text-compatible versions of functions like words and so on.) Both String and Text have StringLike instances, so they both work fine with TagSoup.

    If we were going to write production-quality code, we'd actually consult the response headers from the HTTP request and/or check for a <meta> tag in the HTML to determine the real encoding. But, if we just assume the coding is UTF-8 (which it is), the Text version looks like this:

    import Network.HTTP.Conduit
    import Text.HTML.TagSoup
    import qualified Data.Text.Lazy as TL
    import qualified Data.Text.Lazy.Encoding as TL
    import qualified Data.ByteString.Lazy as BL
    
    main :: IO ()
    main = do
        lbs <- simpleHttp "https://wiki.haskell.org"
        let lastModifiedDateTime = fromFooter $ parseTags (TL.decodeUtf8 lbs)
        putStrLn $ "wiki.haskell.org was last modified on " 
            ++ TL.unpack lastModifiedDateTime
        where fromFooter = TL.unwords . drop 6 . TL.words
                  . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
    

    and a String version using Data.ByteString.Lazy.UTF8 from the utf8-string package looks like this:

    import Network.HTTP.Conduit
    import Text.HTML.TagSoup
    import qualified Data.ByteString.Lazy as BL
    import qualified Data.ByteString.Lazy.UTF8 as BL
    
    main :: IO ()
    main = do
        lbs <- simpleHttp "https://wiki.haskell.org"
        let lastModifiedDateTime = fromFooter $ parseTags (BL.toString lbs)
        putStrLn $ "wiki.haskell.org was last modified on " 
            ++ lastModifiedDateTime
        where fromFooter = unwords . drop 6 . words
                  . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")