Search code examples
python-3.xcurlweb-scrapingwget

How to get exact page content in wget if error code is 404


I have two url one is working url another one is page deleted url.working url is fine but for page deleted url instead of getting the exact page content wget receives 404

Working url

import os
def curl(url):
    data = os.popen('wget -qO- %s '% url).read()
    print (url)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_41/bio")

Output:

https://www.reverbnation.com/artist_41/bio
80067

Page Deleted url

import os
def curl(url):
    data = os.popen('wget -qO- %s '% url).read()
    print (url)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_42/bio")

output:

https://www.reverbnation.com/artist_42/bio
0

I get length as 0 but live page has some content in it

How to receive the exact content in wget or curl


Solution

  • wget has a switch called "--content-on-error":

    --content-on-error
               If this is set to on, wget will not skip the content
    

    which outputs the more information whenever the server responds with an HTTP status code that indicates the error.

    So just add it to your code and you will have the "content" of the 404 pages too:

    import os
    def curl(URL):
        data = os.popen('wget --content-on-error -qO- %s '% url).read()
        print (URL)
        print (len(data))
        #print (data)
    
    curl("https://www.reverbnation.com/artist_42/bio")