Search code examples
bashfilecurlwget

Find latest link from a HTML page listing download locations


I'm trying to build an equivalent to the following github-specific code that works for finding the latest artifact available for download from https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master -- the download links look something like https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5901-5db768d8bbb973ba27c81e424aea2910144a3100/fx.tar.xz.

# Working code for github.com, needs to be converted to fivem.net
LOCATION=$(curl -s https://api.github.com/repos/someuser/somerepo/releases/latest \
| grep "tag_name" \
| awk '{print "https://github.com/someuser/somerepo/archive/" substr($2, 2, length($2)-3) ".zip"}') \
; curl -L -o file.zip $LOCATION

The file has an incremental version number but not a sequential number, followed by a completely random hash.

How can I find the latest download link from the HTML page at https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master?


Solution

  • We can build off the use of lynx dump, as suggested in Easiest way to extract the urls from an html page using sed or awk only --

    #!/usr/bin/env bash
    
    url_re='https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/([[:digit:]]+)-([[:xdigit:]]+)/fx.tar.xz'
    newest_link_num=0
    newest_link_content=
    while read -r _ link; do
      [[ $link =~ $url_re ]] || continue
      if (( ${BASH_REMATCH[1]} > newest_link_num )); then
        newest_link_num=${BASH_REMATCH[1]}
        newest_link_content=$link
      fi
    done < <(lynx -dump -listonly -hiddenlinks=listonly https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master)
    
    echo "Newest link is: $newest_link_content"
    

    As of this writing, it finishes with the following output:

    Newest link is: https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5901-5db768d8bbb973ba27c81e424aea2910144a3100/fx.tar.xz