Search code examples
httpcommon-lisphtml-parsing

How to use Common Lisp libraries of dex, plump, and clss to extract the title of a web page?


I am using Emacs, Slime, and SBCL to develop Common Lisp in a Desktop PC running NixOS.

In addition, I am using the libraries dex, plump, and clss to extract the title of a webpage. Thus, I did:

CL-USER> (clss:select "title" (plump:parse  (dex:get "http://www.pdelfino.com.br")))
#(#<PLUMP-DOM:ELEMENT title {1009C488E3}>)

I was expecting: "Pedro Delfino".

Instead, I got the object:

#(#<PLUMP-DOM:ELEMENT title {1009C488E3}>)

If I describe the object it does not help me finding the value I want:

CL-USER> (clss:select "title" (plump:parse  (dex:get "http://www.pdelfino.com.br")))
#(#<PLUMP-DOM:ELEMENT title {100A9888E3}>)
CL-USER> (describe *)
#(#<PLUMP-DOM:ELEMENT title {100A9888E3}>)
  [vector]

Element-type: T
Fill-pointer: 1
Size: 10
Adjustable: yes
Displaced: no
Storage vector: #<(SIMPLE-VECTOR 10) {100A9B65BF}>
; No value
CL-USER> 

Where is the value that I need?

Thanks


Solution

  • You can ask plump to return the text inside the HTML node with plump:text. It accepts one node, and not an array (returned by clss:select), so you have to use aref to get the first one.

    (plump:text (aref  
       (clss:select "title" (plump:parse  
         (dex:get "http://www.pdelfino.com.br"))) 
       0))
    

    plump:serialize would return the HTML content (useful to inspect the results).

    You can also use CLSS and Plump together at the same time by using LQuery. https://shinmera.github.io/lquery/ We need to parse the HTML with initialize, then we use $ as in (lquery:$ <document> "selector"). We can add (text) or (serialize) as last arguments.

    (defparameter *PDELFINO-PARSED* (lquery:$ (initialize (dex:get "http://www.pdelfino.com.br"))))
    
    (lquery:$ *PDELFINO-PARSED* "title")
    #(#<PLUMP-DOM:ELEMENT title {1008645923}>)
    
    CIEL-USER> (lquery:$ *PDELFINO-PARSED* "title" (text))
    #("Pedro Delfino")
    
    CIEL-USER> (aref * 0)
    "Pedro Delfino"
    
    CIEL-USER> (lquery:$ *PDELFINO-PARSED* "title" (serialize))
    #("<title>Pedro Delfino</title>")