Search code examples
python-3.xbeautifulsouphttp-status-code-403

BeautifulSoup returning 403 error for some sites


I don't understand why I am getting a 403 error for some of these sites.

If I visit the URLs manually the pages load fine. There isn't any error message other that the 403 response, so I don't know how to diagnose the problem.

from bs4 import BeautifulSoup
import requests    

test_sites = [
 'http://fashiontoast.com/',
 'http://becauseimaddicted.net/',
 'http://www.lefashion.com/',
 'http://www.seaofshoes.com/',
 ]

for site in test_sites:
    print(site)
    #get page soure
    response = requests.get(site)
    print(response)
    #print(response.text)

Result of running the above code is...

http://fashiontoast.com/

Response [403]

http://becauseimaddicted.net/

Response [403]

http://www.lefashion.com/

Response [200]

http://www.seaofshoes.com/

Response [200]

Can anyone help me understand the cause of the problem and the solution please?


Solution

  • Sometimes page rejects GET requests that do not identify a User-Agent.

    Visit the page with a browser (Chrome). Right clcik then 'Inspect'. Copy the User-Agent header of the GET request (look in the Network tab.

    enter image description here

    from bs4 import BeautifulSoup
    import requests
    
    with requests.Session() as se:
        se.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
            "Accept-Encoding": "gzip, deflate",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Accept-Language": "en"
        }
    
    
    test_sites = [
     'http://fashiontoast.com/',
     'http://becauseimaddicted.net/',
     'http://www.lefashion.com/',
     'http://www.seaofshoes.com/',
     ]
    
    for site in test_sites:
        print(site)
        #get page soure
        response = se.get(site)
        print(response)
        #print(response.text)
    

    Output:

    http://fashiontoast.com/
    <Response [200]>
    http://becauseimaddicted.net/
    <Response [200]>
    http://www.lefashion.com/
    <Response [200]>
    http://www.seaofshoes.com/
    <Response [200]>