python web-scraping python-requests extract python-re

Why is the data retrieved showing as blank instead of outputting the correct numbers?

I can't seem to see what is missing. Why is the response not printing the ASINs?

import requests
import re

urls = [
    'https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2',
    'https://www.amazon.com/s?k=ps4+game&ref=nb_sb_noss_2'
]

for url in urls:
    content = requests.get(url).content
    decoded_content = content.decode()

    asins = set(re.findall(r'/[^/]+/dp/([^"]+)', decoded_content))
    print(asins)

traceback

set()
set()
[Finished in 0.735s]

Solution

Regular expressions should not be used to parse HTML. Every StackOverflow answer to questions like this do not recommend regex for HTML. It is difficult to write a regular expression complex enough to get the data-asin value from each <div>. The BeautifulSoup library will make this task easier. But if you must use regex, this code will return everything inside of the body tags:

re.findall(r'<body.*?>(.+?)</body>', decoded_content, flags=re.DOTALL)

Also, print decoded_content and read the HTML. You might not be receiving the same website that you see in the web browser. Using your code I just get an error message from Amazon or a small test to see if I am a robot. If you do not have real headers attached to your request, big websites like Amazon will not return the page you want. They try to prevent people from scraping their site.

Here is some code that works using the BeautifulSoup library. You need to install the library first pip3 install bs4.

from bs4 import BeautifulSoup
import requests

def getAsins(url):
    headers = requests.utils.default_headers()
    headers.update({'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
    decoded_content = requests.get(url, headers=headers).content.decode()
    soup = BeautifulSoup(decoded_content, 'html.parser')
    asins = {}
    for asin in soup.find_all('div'):
        if asin.get('data-asin'):
            asins[asin.get('data-uuid')] = asin.get('data-asin')
    return asins

'''
result = getAsins('https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2')
print(result)

{None: 'B07RBN5C9C', '8652921a-81ee-4e15-b12d-5129c3d35195': 'B07P15JL3T', 'cb25b4bf-efc3-4bc6-ae7f-84f69dcf131b': 'B0886YWLC9', 'bc730e28-2818-472d-bc03-6e9fb97dcaad': 'B089F8R7SQ', '339c4ca0-1d24-4920-be60-54ef6890d542': 'B08GQW447N', '4532f725-f416-4372-8aa0-8751b2b090cc': 'B08DD5559K', 'a0e17b74-7457-4df7-85c9-5eefbfe4025b': 'B08BXHCQKR', '52ef86ef-58ac-492d-ad25-46e7bed0b8b9': 'B087XR383W', '3e79c338-525c-42a4-80da-4f2014ed6cf7': 'B07H5VVV1H', '45007b26-6d8c-4120-9ecc-0116bb5f703f': 'B07DJW4WZC', 'dc061247-2f4c-4f6b-a499-9e2c2e50324b': 'B07YLGXLYQ', '18ff6ba3-37b9-44f8-8f87-23445252ccbd': 'B01FST8A90', '6d9f29a1-9264-40b6-b34e-d4bfa9cb9b37': 'B088MZ4R82', '74569fd0-7938-4375-aade-5191cb84cd47': 'B07SXMV28K', 'd35cb3a0-daea-4c37-89c5-db53837365d4': 'B07DFJJ3FN', 'fc0b73cc-83dd-44d9-b920-d08f07be76eb': 'B07KYC1VL7', 'eaeb69d1-a2f9-4ea4-ac97-1d9a955d706b': 'B076PRWVFG', '0aafbb75-1bac-492c-848e-a046b2de9978': 'B07Q47W1B4', '9e373245-9e8b-4564-a32f-42baa7b51d64': 'B07C4SGGZ2', '4af7587a-98bf-41e0-bde6-2a2fad512d95': 'B07SJ2T3CW', '8635a92e-22a7-4474-a27d-3db75c75e500': 'B08D44W56B', '49d752ce-5d68-4323-be9b-3cbb34c8b562': 'B086JQGB7W', '6398531f-6864-4c7b-9879-84ee9de57d80': 'B07XD3TK36'}
'''

If you are reading html from a file then:

from bs4 import BeautifulSoup
import requests

def getAsins(location_to_file):
    file = open(location_to_file)
    soup = BeautifulSoup(file, 'html.parser')
    asins = {}
    for asin in soup.find_all('div'):
        if asin.get('data-asin'):
            asins[asin.get('data-uuid')] = asin.get('data-asin')
    return asins