Search code examples
xmlcurlxml-parsingwgetxmllint

Get the resources under a specific div (wget, xmllint, etc)


I already managed to get a section of the website I wanted. But, without the resources (audio).

wget -q -O - "https://dictionary.cambridge.org/dictionary/english/admirable" | xmllint --html --xpath '//div[@class = "pos-header dpos-h"]' - 2>/dev/null > admirable-wget

This is the section of the website,

enter image description here

How can I include it in a path or something? I would like to play it with mpv, latter in the script I'm building.


Solution

  • Get the path to the media file with this XPath expression:

    string(//amp-audio[@id="ampaudio1"]/source[@type="audio/ogg"]/@src)
    

    Full command

    wget -q -O - "https://dictionary.cambridge.org/dictionary/english/admirable" | xmllint --recover --html --xpath 'string(//amp-audio[@id="ampaudio1"]/source[@type="audio/ogg"]/@src)' 
    

    Result

    /media/english/uk_pron_ogg/u/uka/ukadj/ukadjus011.ogg
    

    Then download it

    wget -q "https://dictionary.cambridge.org/media/english/uk_pron_ogg/u/uka/ukadj/ukadjus011.ogg"
    

    Note: check site's terms of use