Search code examples
clojureenlive

Rescraping data with Enlive


I tried to create function to scrape and tags from HTML page, whose URL I provide to a function, and this works as it should. I get sequence of <h3> and <table> elements, when I try to use select function to extract only table or h3 tags from resulting sequence, I get (), or if I try to map those tags I get (nil nil nil ...).

Could you please help me to resolve this issue, or explain me what am I doing wrong?

Here is the code:

(ns Test2 
  (:require [net.cgrand.enlive-html :as html]) 
  (:require [clojure.string :as string])) 

(defn get-page 
  "Gets the html page from passed url" 
  [url] 
  (html/html-resource (java.net.URL. url))) 

(defn h3+table       
    "returns sequence of <h3> and <table> tags"
  [url] 
  (html/select (get-page url) 
{[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3] 
[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]} 
               )) 

(def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")

This line gives me headache :

(html/select (h3+table url) [:table])

Could you please tell me what am I doing wrong?

Just to clarify my question: is it possible to use enlive's select function to extract only table tags from result of (h3+table url) ?


Solution

  • As @Julien pointed out, you will probably have to work with the deeply nested tree structure that you get from applying (html/select raw-html selectors) on the raw html. It seems like you try to apply html/select multiple times, but this doesn't work. html/select parses html into a clojure datastructure, so you can't apply it on that datastructure again.

    I found that parsing the website was actually a little involved, but I thought that it might be a nice use case for multimethods, so I hacked something together, maybe this will get you started:

    (The code is ugly here, you can also checkout this gist)

    (ns tutorial.scrape1
      (:require [net.cgrand.enlive-html :as html]))
    
    (def *url* "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
    
    (defn get-page [url] 
      (html/html-resource (java.net.URL. url))) 
    
    (defn content->string [content]
      (cond
       (nil? content)    ""
       (string? content) content
       (map? content)    (content->string (:content content))
       (coll? content)   (apply str (map content->string content))
       :else             (str content)))
    
    (derive clojure.lang.PersistentStructMap ::Map)
    (derive clojure.lang.PersistentArrayMap  ::Map)
    (derive java.lang.String                 ::String)
    (derive clojure.lang.ISeq                ::Collection)
    (derive clojure.lang.PersistentList      ::Collection)
    (derive clojure.lang.LazySeq             ::Collection)
    
    (defn tag-type [node]
      (case (:tag node) 
       :tr    ::CompoundNode
       :table ::CompoundNode
       :th    ::TerminalNode
       :td    ::TerminalNode
       :h3    ::TerminalNode
       :tbody ::IgnoreNode
       ::IgnoreNode))
    
    (defmulti parse-node
      (fn [node]
        (let [cls (class node)] [cls (if (isa? cls ::Map) (tag-type node) nil)])))
    
    (defmethod parse-node [::Map ::TerminalNode] [node]
      (content->string (:content node)))
    (defmethod parse-node [::Map ::CompoundNode] [node]
      (map parse-node (:content node)))
    (defmethod parse-node [::Map ::IgnoreNode] [node]
      (parse-node (:content node)))
    (defmethod parse-node [::String nil] [node]
      node)
    (defmethod parse-node [::Collection nil] [node]
      (map parse-node node))
    
    (defn h3+table [url] 
     (let [ws-content (get-page url)
           h3s+tables (html/select ws-content #{[:div#prospekt_container :h3]
                                                [:div#prospekt_container :table]})]
       (for [node h3s+tables] (parse-node node)))) 
    

    A few words on what's going on:

    content->string takes a data structure and collects its content into a string and returns that so you can apply this to content that may still contain nested subtags (like <br/>) that you want to ignore.

    The derive statements establish an ad hoc hierarchy which we will later use in the multi-method parse-node. This is handy because we never quite know which data structures we're going to encounter and we could easily add more cases later on.

    The tag-type function is actually a hack that mimics the hierarchy statements - AFAIK you can't create a hierarchy out of non-namespace qualified keywords, so I did it like this.

    The multi-method parse-node dispatches on the class of the node and if the node is a map additionally on the tag-type.

    Now all we have to do is define the appropriate methods: If we're at a terminal node we convert the contents to a string, otherwise we either recur on the content or map the parse-node function on the collection we're dealing with. The method for ::String is actually not even used, but I left it in for safety.

    The h3+table function is pretty much what you had before, I simplified the selectors a bit and put them into a set, not sure if putting them into a map as you did works as intended.

    Happy scraping!