I'm scraping a website with some poorly structured HTML using a Clojure wrapper around jsoup called Reaver. Here is an example of some of the HTML structure:
<div id="article">
<aside>unwanted text</aside>
<p>Some text</p>
<nav><ol><li><h2>unwanted text</h2></li></ol></nav>
<p>More text</p>
<h2>A headline</h2>
<figure><figcaption>unwanted text</figcaption></figure>
<p>More text</p>
Here is a paragraph made of some raw text directly in the div
<p>Another paragraph of text</p>
More raw text and this one has an <a>anchor tag</a> inside
<dl>
<dd>unwanted text</dd>
</dl>
<p>Etc etc</p>
</div>
This div
represents an article on a wiki. I want to extract the text from it, but as you can see, some paragraphs are in p
tags, and some are contained directly within the div. I also need the headlines and anchor tag text.
I know how to parse and extract the text from all of the p
, a
, and h
tags, and I can select for the div
and extract the inner text from it, but the problem is that I end up with two selections of text that I need to merge somehow.
How can I extract the text from this div, so that all of the text from the p
, a
, h
tags, as well as the inner text on the div
, are extracted in order? The result should be paragraphs of text in the same order as what is in the HTML.
Here is what I am currently using to extract, but the inner div
text is missing from the results:
(defn get-texts [url]
(:paragraphs (extract (parse (slurp url))
[:paragraphs]
"#article > *:not(aside, nav, table, figure, dl)" text)))
Note also that additional unwanted elements appear in this div
, e.g., aside
, figure
, etc. These elements contain text, as well as nested elements with text, that should not be included in the result.
You could extract the entire article as a JSoup object (likely an Element
), then convert it to an EDN representation using reaver/to-edn
. Then you go through the :content
of that and handle both strings (the result of TextNode
s) and elements that have a :tag
that interests you.
(Code by vaer-k)
(defn get-article [url]
(:article (extract (parse (slurp url))
[:article]
"#article"
edn)))
(defn text-elem?
[element]
(or (string? element)
(contains? #{:p :a :b :i} (:tag element))))
(defn extract-text
[{content :content}]
(let [text-children (filter text-elem? content)]
(reduce #(if (string? %2)
(str %1 %2)
(str %1 (extract-text %2)))
""
text-children)))
(defn extract-article [url]
(-> url
get-article
extract-text))