Search code examples
pythonscraper

Read all pages within a domain


I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:

import urllib,re

htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............

This doe not do the trick for me though. Any ideas are appreciated.

Thanks. -T


Solution

  • In addition to @zigdon answer I recommend you to take a look at scrapy framework.

    CrawlSpider will help you to implement crawling quite easily.