Search code examples
parsingxpathclojurerss

Parsing rss feed in clojure with xpath


I am trying to parse this bit of rss

<rss xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        <title>
            Signal RSS - full
        </title>
        <link>
            https://www.mystery.com
        </link>
        <description>
            null
        </description>
        <pubDate>
            Wed, 09 Mar 2022 14:07:31 GMT
        </pubDate>
        <lastBuildDate>
            Wed, 09 Mar 2022 14:07:31 GMT
        </lastBuildDate>
        <item>
            <guid isPermaLink="false">
                someid
            </guid>
            <description>
                -- other text
            </description>
            <text>
                BC-AT&amp;T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market
            </text>
            <content medium="document" expression="custom" type="text/vnd.IPTC.NewsML" lang="EN" url="https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e" />
        </item>
    </channel>
</rss>

Fairly standard, right?

using example from https://kyleburton.github.io/clj-xpath/site/ I modified it into this:

(ns clj-xpath-examples.core
  (:require
   [clojure.string :as string]
   [clojure.pprint :as pp])
  (:use
   clj-xpath.core))

(def input (slurp '.pathToXml.xml'))

(xml->doc input)

which gives me this error I cannot understand:

; IllegalAccessException class clojure.lang.Reflector cannot access class com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl (in module java.xml) because module java.xml does not export com.sun.org.apache.xerces.internal.jaxp to unnamed module @689eb690  jdk.internal.reflect.Reflection.newIllegalAccessException (Reflection.java:392)

Where am I going wrong? If I can use xpath for this it would my solution much neater.


Solution

  • Here is one way to do it:

    (ns tst.demo.core
      (:use tupelo.core tupelo.test)
      (:require
        [clojure.walk :as walk]
        [tupelo.forest :as forest]
        [tupelo.parse.xml :as xml]
        [tupelo.string :as str]
        ))
    
    (def xml-str
      (str/quotes->double "
          <rss xmlns:media='http://search.yahoo.com/mrss/' version='2.0'>
            <channel>
                <title>
                    Signal RSS - full
                </title>
                <link>
                    https://www.mystery.com
                </link>
                <description>
                    null
                </description>
                <pubDate>
                    Wed, 09 Mar 2022 14:07:31 GMT
                </pubDate>
                <lastBuildDate>
                    Wed, 09 Mar 2022 14:07:31 GMT
                </lastBuildDate>
                <item>
                    <guid isPermaLink='false'>
                        someid
                    </guid>
                    <description>
                        -- other text
                    </description>
                    <text>
                        BC-AT&amp;T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market
                    </text>
                    <content medium='document' expression='custom' type='text/vnd.IPTC.NewsML' lang='EN' url='https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e' />
                </item>
            </channel>
        </rss> "))
    

    with unit test:

    (dotest
      (let [enlive-raw  (xml/parse xml-str)
            enlive-nice (walk/postwalk (fn [item]
                                         (if (string? item)
                                           (str/trim item)
                                           item))
                          enlive-raw)]
        (is= enlive-nice
          {:attrs   {:version "2.0" :xmlns:media "http://search.yahoo.com/mrss/"}
           :content [{:attrs   {}
                      :content [{:attrs {} :content ["Signal RSS - full"] :tag :title}
                                {:attrs {} :content ["https://www.mystery.com"] :tag :link}
                                {:attrs {} :content ["null"] :tag :description}
                                {:attrs   {}
                                 :content ["Wed, 09 Mar 2022 14:07:31 GMT"]
                                 :tag     :pubDate}
                                {:attrs   {}
                                 :content ["Wed, 09 Mar 2022 14:07:31 GMT"]
                                 :tag     :lastBuildDate}
                                {:attrs   {}
                                 :content [{:attrs {:isPermaLink "false"} :content ["someid"] :tag :guid}
                                           {:attrs {} :content ["-- other text"] :tag :description}
                                           {:attrs   {}
                                            :content ["BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market"]
                                            :tag     :text}
                                           {:attrs   {:expression "custom"
                                                      :lang       "EN"
                                                      :medium     "document"
                                                      :type       "text/vnd.IPTC.NewsML"
                                                      :url        "https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e"}
                                            :content []
                                            :tag     :content}]
                                 :tag     :item}]
                      :tag     :channel}]
           :tag :rss})))
    

    Build using my favorite template project.


    P.S. You may also be interested in the Tupelo Forest library:

    (forest/enlive->hiccup enlive-nice) => 
    [:rss
     {:version "2.0", :xmlns:media "http://search.yahoo.com/mrss/"}
     [:channel
      [:title "Signal RSS - full"]
      [:link "https://www.mystery.com"]
      [:description "null"]
      [:pubDate "Wed, 09 Mar 2022 14:07:31 GMT"]
      [:lastBuildDate "Wed, 09 Mar 2022 14:07:31 GMT"]
      [:item
       [:guid {:isPermaLink "false"} "someid"]
       [:description "-- other text"]
       [:text
        "BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market"]
       [:content
        {:expression "custom",
         :lang "EN",
         :medium "document",
         :type "text/vnd.IPTC.NewsML",
         :url
         "https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e"}]]]]
    

    P.P.S. You may also be interested in