I'm having trouble with parsing data from url.
I have url with "https://" so i think i should use import Network.HTTP.Conduit But
simpleHttp url
returns L.ByteString I really don't understand what shoud i do after that
So i have such code to get data
toStrict1 :: L.ByteString -> B.ByteString
toStrict1 = B.concat . L.toChunks
main :: IO ()
main = do
lbs <- simpleHttp url
let page = toStrict1 lbs
and example of parsing
let lastModifiedDateTime = fromFooter $ parseTags doc
putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
How can i combine this two parts of code?
As you've seen, the simpleHttp
function returns a lazy bytestring. There are several ways to deal with this in TagSoup.
First, it turns out that you can parse it directly. The function parseTags
has signature:
parseTags :: StringLike str => str -> [Tag str]
meaning that it can parse any type str
with a StringLike
instance, and if you look at the Text.StringLike
module documentation, you'll see that lazy ByteStrings
have a StringLike
instance.
However, if you go this route, you need to be aware that everything's kind of "trapped" in a ByteString
world, so you have to write your code using versions of functions like words
and unwords
that are bytestring-compatible, and even your putStrLn
needs an adapter. A full working example would look like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as CL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags lbs
putStrLn $ "wiki.haskell.org was last modified on "
++ CL.unpack lastModifiedDateTime
where fromFooter = CL.unwords . drop 6 . CL.words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
and it works fine:
> main
wiki.haskell.org was last modified on 9 September 2013, at 22:38.
>
The functions from Data.ByteString.Lazy.Char8
basically assume that the bytestring is ASCII-encoded, which is close enough for this example to work.
However, it would be more robust to decode the bytestring based on the proper character encoding to a valid text type. The two main text types in Haskell are the default String
type, which is inefficient and slow, but easy to work with, and the Text
type, which is highly efficient but a bit more complicated. (Like ByteString
, you need to use Text
-compatible versions of functions like words
and so on.) Both String
and Text
have StringLike
instances, so they both work fine with TagSoup.
If we were going to write production-quality code, we'd actually consult the response headers from the HTTP request and/or check for a <meta>
tag in the HTML to determine the real encoding. But, if we just assume the coding is UTF-8 (which it is), the Text
version looks like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL
import qualified Data.ByteString.Lazy as BL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags (TL.decodeUtf8 lbs)
putStrLn $ "wiki.haskell.org was last modified on "
++ TL.unpack lastModifiedDateTime
where fromFooter = TL.unwords . drop 6 . TL.words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
and a String
version using Data.ByteString.Lazy.UTF8
from the utf8-string
package looks like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.UTF8 as BL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags (BL.toString lbs)
putStrLn $ "wiki.haskell.org was last modified on "
++ lastModifiedDateTime
where fromFooter = unwords . drop 6 . words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")