Search code examples
screen-scrapingweb-crawler

Quickest way to get list of <title> values from all pages on localhost website


I essentially want to spider my local site and create a list of all the titles and URLs as in:

http://localhost/mySite/Default.aspx      My Home Page
http://localhost/mySite/Preferences.aspx  My Preferences
http://localhost/mySite/Messages.aspx     Messages

I'm running Windows. I'm open to anything that works--a C# console app, PowerShell, some existing tool, etc. We can assume that the tag does exist in the document.

Note: I need to actually spider the files since the title may be set in code rather than markup.


Solution

  • A quick and dirty Cygwin Bash script which does the job:

    #!/bin/bash
    for file in $(find $WWWROOT -iname \*.aspx); do
      echo -en $file '\t'
      cat $file | tr '\n' ' ' | sed -i 's/.*<title>\([^<]*\)<\/title>.*/\1/'
    done
    

    Explanation: this finds every .aspx file under the root directory $WWWROOT, replaces all newlines with spaces so that there are no newlines between the <title> and </title>, and then grabs out the text between those tags.