Search code examples
pythonhttphttplib2

Why am I getting an httplib2.RedirectLimit error?


I have a script that takes a URL and returns the value of the page's <title> tag. After a few hundred or so runs, I always get the same error:

File "/home/edmundspenser/Dropbox/projects/myfiles/titlegrab.py", line 202, in get_title
    status, response = http.request(pageurl)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1390, in _request
    raise RedirectLimit("Redirected more times than rediection_limit allows.", response, content)
httplib2.RedirectLimit: Redirected more times than rediection_limit allows.

My function looks like:

def get_title(pageurl):
    http = httplib2.Http()
    status, response = http.request(pageurl)
    x = BeautifulSoup(response, parseOnlyThese=SoupStrainer('title'))
    x = str(x)
    y = x[7:-8]
    z = y.split('-')[0]
    return z

Pretty straightforward. I used try and except and time.sleep(1) to give it time to maybe get unstuck if that was the issue but so far nothing has worked. And I don't want to pass on it. Maybe the website is rate-limiting me?

edit: As of right now the script doesn't work at all, it runs into said error with the first request.

I have a json file of over 80,000 URLs of www.wikiart.org painting pages. For each one I run my function to get the title. So:

print repr(get_title('http://www.wikiart.org/en/vincent-van-gogh/van-gogh-s-chair-1889'))

returns

"Van Gogh's Chair"

Solution

  • Try using the Requests library. On my end, there seems to be no rate-limiting that I've seen. I was able to retrieve 13 titles in 21.6s. See below:

    Code:

    import requests as rq
    from bs4 import BeautifulSoup as bsoup
    
    def get_title(url):
    
        r = rq.get(url)
        soup = bsoup(r.content)
        title = soup.find_all("title")[0].get_text()
        print title.split(" - ")[0]
    
    def main():
    
        urls = [
        "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
        "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
        "http://www.wikiart.org/en/claude-monet/dandelions",
        "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
        "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
        "http://www.wikiart.org/en/jean-michel-basquiat/boxer",
        "http://www.wikiart.org/en/fernand-leger/three-women-1921",
        "http://www.wikiart.org/en/alphonse-mucha/flower-1897",
        "http://www.wikiart.org/en/alphonse-mucha/ruby",
        "http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
        "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
        "http://www.wikiart.org/en/m-c-escher/lizard-1",
        "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
        ]
    
        for url in urls:
            get_title(url)
    
    if __name__ == "__main__":
        main()
    

    Output:

    Tiger in a Tropical Storm (Surprised!) 
    The Green Dancer
    Dandelions
    The Little Owl
    Farmhouse with Birch Trees
    Boxer
    Three Women
    Flower
    Ruby
    Musical Instruments
    The evening gown
    Lizard
    The Girl with a Pearl Earring
    [Finished in 21.6s]
    

    However, out of personal ethics, I don't recommend doing it like this. With a fast connection, you'll pull data too fast. Allowing the scrape to sleep every 20 pages or so for a few seconds won't hurt.

    EDIT: An even faster version, using grequests, which allows asynchronous requests to be made. This pulls the same data above in 2.6s, nearly 10 times faster. Again, limit your scrape speed out of respect for the site.

    import grequests as grq
    from bs4 import BeautifulSoup as bsoup
    
    def get_title(response):
    
        soup = bsoup(response.content)
        title = soup.find_all("title")[0].get_text()
        print title.split(" - ")[0]
    
    def main():
    
        urls = [
        "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
        "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
        "http://www.wikiart.org/en/claude-monet/dandelions",
        "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
        "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
        "http://www.wikiart.org/en/jean-michel-basquiat/boxer",
        "http://www.wikiart.org/en/fernand-leger/three-women-1921",
        "http://www.wikiart.org/en/alphonse-mucha/flower-1897",
        "http://www.wikiart.org/en/alphonse-mucha/ruby",
        "http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
        "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
        "http://www.wikiart.org/en/m-c-escher/lizard-1",
        "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
        ]
    
        rs = (grq.get(u) for u in urls)
        for i in grq.map(rs):
            get_title(i)
    
    if __name__ == "__main__":
        main()