I'm constantly admitting that I'm bad at lens, but isn't learning by examples is a good thing? I want to take HTML, parse it with taggy-lens
and then remove all the script
elements from inside. Here's my attempt:
#!/usr/bin/env stack
-- stack --resolver lts-7.1 --install-ghc runghc --package text --package lens --package taggy-lens --package string-class --package classy-prelude
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude
import Control.Lens hiding (children, element)
import Data.String.Class (toText, fromText, toString)
import Data.Text (Text)
import Text.Taggy.Lens
import qualified Text.Taggy.Lens as Taggy
import qualified Text.Taggy.Renderer as Renderer
somehtmlSmall :: Text
somehtmlSmall =
"<!doctype html><html><body>\
\<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div>\
\</body></html>"
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(\el ->
el ^.. to universe . traverse .
filtered (\n -> n ^. name /= "script"))
mArticle
in maybe "" (toText . concatMap Renderer.render) mArticleFiltered
main :: IO ()
main = print renderWithoutScriptTag
Mark this file as executable and just run it, and you'll see:
➜ tmp ./scraping-question.hs
"<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div><div>first</div><div>second</div><div>third</div>"
So, this didn't work. I would like to:
Would be especially thankful, if you'd help me realize what's wrong with mine. Thanks!
The root of your problem is universe
, which flattens the DOM tree into a list. If you look again at the output, you will see the filtering is working fine but the tree structure is lost -- and so you get the unmodified article element (with all children still within) followed by the children nodes minus the script element.
One Control.Lens.Plated
combinator that can do what you want is transform
, which transforms "every element in the tree, in a bottom-up manner":
transform :: Plated a => (a -> a) -> a -> a
In particular, you can use it to filter the children nodes recursively:
renderWithoutScriptTag :: Text
renderWithoutScriptTag =
let mArticle :: Maybe Taggy.Element
mArticle =
(fromText somehtmlSmall) ^? html .
allAttributed (ix "id" . only "article")
mArticleFiltered =
fmap
(transform (children %~ filter (\n ->
n ^? element . name /= Just "script")))
mArticle
in maybe "" (toText . Renderer.render) mArticleFiltered