Search code examples
parsinggrephtml-parsing

Getting the jpg images from an HTML file


I'm trying to use grep to get the full url addresses of jpg images in an HTML file. One problem is that there aren't many newlines in it, so when I use grep it gets the path, but also a lot of other stuff I'm not interested in. How can I just get the urls for the jpg images?


Solution

  • One single sed command

    sed -n '/<img/s/.*src="\([^"]*\)".*/\1/p' yourfile.html
    

    or using ERE (extended regular expressions) to avoid backslashes from above expression:

    sed -E -n '/<img/s/.*src="([^"]*)".*/\1/p' yourfile.html
    

    One basic grep command

    grep -o '<img[^>]*src="[^"]*"' yourfile.html
    

    Two successive basic grep commands

    grep -o '<img[^>]*src="[^"]*"' yourfile.html | grep -o '"[^"]*"'
    

    One single grep commands using Perl Regex (PER)

    grep -Po '<img[^>]*src="\K[^"]*(?=")' yourfile.html
    

    Using ack as a grep-like replacement

    sudo apt install ack
    ack -o '<img[^>]*src="\K[^"]*(?=")' yourfile.html
    

    Downloading a web page as proposed by s-hunter

    curl -s example.com/a.html | sed -En '/<img/s/.*src="([^"]*)".*/\1/p'