Search code examples
haskellhaskell-lens

Filter inner element from a tree via lens


I'm constantly admitting that I'm bad at lens, but isn't learning by examples is a good thing? I want to take HTML, parse it with taggy-lens and then remove all the script elements from inside. Here's my attempt:

#!/usr/bin/env stack
-- stack --resolver lts-7.1 --install-ghc runghc --package text --package lens --package taggy-lens --package string-class --package classy-prelude

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}

import ClassyPrelude
import Control.Lens hiding (children, element)
import Data.String.Class (toText, fromText, toString)
import Data.Text (Text)
import Text.Taggy.Lens
import qualified Text.Taggy.Lens as Taggy
import qualified Text.Taggy.Renderer as Renderer

somehtmlSmall :: Text
somehtmlSmall =
    "<!doctype html><html><body>\
    \<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div>\
    \</body></html>"

renderWithoutScriptTag :: Text
renderWithoutScriptTag =
    let mArticle :: Maybe Taggy.Element
        mArticle =
            (fromText somehtmlSmall) ^? html .
            allAttributed (ix "id" . only "article")
        mArticleFiltered =
            fmap
                (\el ->
                      el ^.. to universe . traverse .
                      filtered (\n -> n ^. name /= "script"))
                mArticle
    in maybe "" (toText . concatMap Renderer.render) mArticleFiltered

main :: IO ()
main = print renderWithoutScriptTag

Mark this file as executable and just run it, and you'll see:

➜  tmp  ./scraping-question.hs
"<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div><div>first</div><div>second</div><div>third</div>"

So, this didn't work. I would like to:

  • have a working solution
  • understand the working solution

Would be especially thankful, if you'd help me realize what's wrong with mine. Thanks!


Solution

  • The root of your problem is universe, which flattens the DOM tree into a list. If you look again at the output, you will see the filtering is working fine but the tree structure is lost -- and so you get the unmodified article element (with all children still within) followed by the children nodes minus the script element.

    One Control.Lens.Plated combinator that can do what you want is transform, which transforms "every element in the tree, in a bottom-up manner":

    transform :: Plated a => (a -> a) -> a -> a
    

    In particular, you can use it to filter the children nodes recursively:

    renderWithoutScriptTag :: Text
    renderWithoutScriptTag =
        let mArticle :: Maybe Taggy.Element
            mArticle =
                (fromText somehtmlSmall) ^? html .
                allAttributed (ix "id" . only "article")
            mArticleFiltered =
                fmap
                    (transform (children %~ filter (\n ->
                        n ^? element . name /= Just "script")))
                    mArticle
        in maybe "" (toText . Renderer.render) mArticleFiltered