could someone explain me how to scrape content from <td>
tags where the <th>
has content value (actually in this case I need content of <b>
tag for matching operation) "Row1 title", but without scraping <th>
tag (or any of its content) in process? Here is my test HTML:
<table class="table_class">
<tbody>
<tr>
<th>
<b>
Row1 title
</b>
</th>
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
</tr>
<tr>
<th>
Row2 title
</th>
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
</tr>
</tbody>
</table>
Data which I want to extract should come from these tags:
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
I have managed to create function which returns entire content of the table, but I would like to exclude the <th>
node from result, and to return only data from <td>
nodes, which content I can use for further parsing. Can anyone help me with this?
With enlive something like this
(ns tutorial.so-scrape
(:require [net.cgrand.enlive-html :as html])
(defn parse-tds [url]
(html/select (html/html-resource (java.net.URL. url)) [:table :td]))
should give you a sequence of all the td
nodes, something of the form {:tag :td :attrs {...} :content (...)}
. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.
You could then extract the content of the sequence for something along the lines of
(for [line ws-content] (apply str (:content line)))
In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the tag-type
function like this
(defn tag-type [node]
(case (:tag node)
:td ::TerminalNode
::IgnoreNode)
(change the return value of all nodes to ::IgnoreNode
except for :td
then it just gives you a sequence of the content of the :td
s which is probably close to what you want. Let me know if you need more help.
EDIT (in reply to comments below)
I don't think selecting nodes based on their :content
is possible with enlive alone - but you can certainly do so with Clojure.
for example you could do something like
(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
(:content line))
could work. (you might have to tweak the (:content line)
form a little..