Search code examples
htmlclojurejsoup

How can I extract text from an HTML element containing a mix of `p` tags and inner text?


I'm scraping a website with some poorly structured HTML using a Clojure wrapper around jsoup called Reaver. Here is an example of some of the HTML structure:

<div id="article">
  <aside>unwanted text</aside>
  <p>Some text</p>
  <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
  <p>More text</p>
  <h2>A headline</h2>
  <figure><figcaption>unwanted text</figcaption></figure>
  <p>More text</p>
  Here is a paragraph made of some raw text directly in the div
  <p>Another paragraph of text</p>
  More raw text and this one has an <a>anchor tag</a> inside
  <dl>
    <dd>unwanted text</dd>
  </dl>
  <p>Etc etc</p>
</div>

This div represents an article on a wiki. I want to extract the text from it, but as you can see, some paragraphs are in p tags, and some are contained directly within the div. I also need the headlines and anchor tag text.

I know how to parse and extract the text from all of the p, a, and h tags, and I can select for the div and extract the inner text from it, but the problem is that I end up with two selections of text that I need to merge somehow.

How can I extract the text from this div, so that all of the text from the p, a, h tags, as well as the inner text on the div, are extracted in order? The result should be paragraphs of text in the same order as what is in the HTML.

Here is what I am currently using to extract, but the inner div text is missing from the results:

(defn get-texts [url]
  (:paragraphs (extract (parse (slurp url))
                        [:paragraphs]
                        "#article > *:not(aside, nav, table, figure, dl)" text)))

Note also that additional unwanted elements appear in this div, e.g., aside, figure, etc. These elements contain text, as well as nested elements with text, that should not be included in the result.


Solution

  • You could extract the entire article as a JSoup object (likely an Element), then convert it to an EDN representation using reaver/to-edn. Then you go through the :content of that and handle both strings (the result of TextNodes) and elements that have a :tag that interests you.

    (Code by vaer-k)

    (defn get-article [url]
      (:article (extract (parse (slurp url))
                         [:article]
                         "#article"
                         edn)))
    
    (defn text-elem?
      [element]
      (or (string? element)
          (contains? #{:p :a :b :i} (:tag element))))
    
    (defn extract-text
      [{content :content}]
      (let [text-children (filter text-elem? content)]
        (reduce #(if (string? %2)
                   (str %1 %2)
                   (str %1 (extract-text %2)))
                ""
                text-children)))
    
    (defn extract-article [url]
      (-> url
          get-article
          extract-text))