Search code examples
pythonhttplib

Use httplib to check if a URL will return a certain page?


Im going through several hundred bit.ly links to see if they have been used to shorten a link. If a link has not, it returns this page.

How can I iterate through a list of links to check which ones do NOT return this page?

I tried using the head method used in this question, however that always returned true, of course.

I looked into the head method, but found out that it never returns any data:

>>> import httplib
>>> conn = httplib.HTTPConnection("www.python.org")
>>> conn.request("HEAD","/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> data = res.read()
>>> print len(data)
0
>>> data == ''
True

I'm stumped on this, and any help would be great.


Solution

  • If bit.ly returns 404 http code for non-shorten links:

    #!/usr/bin/env python
    from httplib import HTTPConnection
    from urlparse import urlsplit
    
    urls = ["http://bit.ly/NKEIV8", "http://bit.ly/1niCdh9"]
    for url in urls:
        host, path = urlsplit(url)[1:3]
        conn = HTTPConnection(host)
        conn.request("HEAD", path)
        r = conn.getresponse()
        if r.status != 404:
           print("{r.status} {url}".format(**vars()))
    

    Unrelated: to speed up the check, you could use multiple threads:

    #!/usr/bin/env python
    from httplib import HTTPConnection
    from multiprocessing.dummy import Pool # use threads
    from urlparse import urlsplit
    
    def getstatus(url):
        try:
            host, path = urlsplit(url)[1:3]
            conn = HTTPConnection(host)
            conn.request("HEAD", path)
            r = conn.getresponse()
        except Exception as e:
            return url, None, str(e) # error
        else:
            return url, r.status, None
    
    p = Pool(20) # use 20 concurrent connections
    for url, status, error in p.imap_unordered(getstatus, urls):
        if status != 404:
           print("{status} {url} {error}".format(**vars()))