Search code examples
pythongetweb-crawler

Python Web Crawlers and "getting" html source code


So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.

Just for background, I need to download a page and replace any img with ones I have

And it would be nice if you guys could tell me your opinion of 2.7 and 3.1


Solution

  • Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).

    I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

    import urllib2
    
    response = urllib2.urlopen("http://google.de")
    page_source = response.read()
    

    For parsing the code, have a look at BeautifulSoup.

    BTW: what exactly do you want to do:

    Just for background, I need to download a page and replace any img with ones I have

    Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.