Search code examples
javaweb-scrapingclojurejsoup

Using JSoup to parse a String with Clojure


Using JSoup to parse a html string with Clojure, the source as the following

Dependencies

:dependencies [[org.clojure/clojure "1.10.1"]
               [org.jsoup/jsoup "1.13.1"]]

Source code

(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
                <body><p>Sample paragraph number 1 </p>
                      <p>Sample paragraph number 2</p>
                </body></html>"))

(defn fetch_html [html]
  (let [soup (Jsoup/parse html)
        titles (.title soup)
        paragraphs (.getElementsByTag soup "p")]
    {:title titles :paragraph paragraphs}))

(fetch_html HTML)

Expected result

{:title "Website title", 
 :paragraph ["Sample paragraph number 1" 
             "Sample paragraph number 2"]}

Unfortunately, the result is not as expected

user ==> (fetch_html HTML)
{:title "Website title", :paragraph []}

Solution

  • (.getElementsByTag ...) returns a sequence of Element's, you need to call .text() method on each element to get the text value. I'm using Jsoup ver 1.13.1.

    
    (ns core
      (:import (org.jsoup Jsoup))
      (:require [clojure.string :as str]))
    
    (def HTML (str "<html><head><title>Website title</title></head>
                    <body><p>Sample paragraph number 1 </p>
                          <p>Sample paragraph number 2</p>
                    </body></html>"))
    
    (defn fetch_html [html]
      (let [soup (Jsoup/parse html)
            titles (.title soup)
            paragraphs (.getElementsByTag soup "p")]
        {:title titles :paragraph (mapv #(.text %) paragraphs)}))
    
    (fetch_html HTML)
    
    

    Also consider using Reaver, which is a Clojure library that wraps JSoup, or any other wrappers like others have suggested.