Search code examples
shellwgettitle

Get page titles from a list of URLS


I have list of URLS where I need to get the page titles saved in another list. wget or curl seems to be the right way, but I don't know exactly how. Can you help? Thanks


Solution

  • You mean something like that?

    wget_title_from_filelist.sh

    #!/bin/bash
    while read -r URL; do
        echo -n "$URL --> "
        wget -q -O - "$URL" | \
           tr "\n" " " | \
           sed 's|.*<title>\([^<]*\).*</head>.*|\1|;s|^\s*||;s|\s*$||'
        echo
    done
    

    filelist.txt

    https://stackoverflow.com
    https://cnn.com
    https://reddit.com
    https://archive.org
    

    Usage

    ./wget_title_from_filelist.sh < filelist.txt
    

    Output

    https://stackoverflow.com --> Stack Overflow - Where Developers Learn, Share, &amp; Build Careers
    https://cnn.com --> CNN International - Breaking News, US News, World News and Video
    https://reddit.com --> reddit: the front page of the internet
    https://archive.org --> Internet Archive: Digital Library of Free &amp; Borrowable Books, Movies, Music &amp; Wayback Machine
    

    explanation

    tr "\n" " "     # remove \n, create one line of input for sed
    
    sed 's|.*<title>\([^<]*\).*</head>.*|\1|;   # find <title> in <head>
    s|^\s*||;                                   # remove leading spaces
    s|\s*$||'                                   # remove trailing spaces