Search code examples

How to scrape data from specified tag with Enlive?

could someone explain me how to scrape content from <td> tags where the <th> has content value (actually in this case I need content of <b> tag for matching operation) "Row1 title", but without scraping <th> tag (or any of its content) in process? Here is my test HTML:

<table class="table_class"> 
                              Row1 title
                         <td>Correct, has 3 td elements</td> 
                              Row2 title                                
                         <td>Correct, has 3 td elements</td> 

Data which I want to extract should come from these tags:

                     <td>Correct, has 3 td elements</td> 

I have managed to create function which returns entire content of the table, but I would like to exclude the <th> node from result, and to return only data from <td> nodes, which content I can use for further parsing. Can anyone help me with this?


  • With enlive something like this

      (:require [net.cgrand.enlive-html :as html])
    (defn parse-tds [url] 
     (html/select (html/html-resource ( url)) [:table :td])) 

    should give you a sequence of all the td nodes, something of the form {:tag :td :attrs {...} :content (...)}. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.

    You could then extract the content of the sequence for something along the lines of
    (for [line ws-content] (apply str (:content line)))

    In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the tag-type function like this

    (defn tag-type [node]
      (case (:tag node) 
       :td    ::TerminalNode

    (change the return value of all nodes to ::IgnoreNode except for :td then it just gives you a sequence of the content of the :tds which is probably close to what you want. Let me know if you need more help.

    EDIT (in reply to comments below) I don't think selecting nodes based on their :content is possible with enlive alone - but you can certainly do so with Clojure.

    for example you could do something like

    (for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
      (:content line))

    could work. (you might have to tweak the (:content line) form a little..