Search code examples
pythonweb-scrapingpython-requestshrefatag

Why can't I get the complete 'href' as showing in browser from noon.com


Here is what I'm doing

import requests
from requests.adapters import HTTPAdapter
from bs4 import BeautifulSoup

HEADERS = {
    'authority': 'www.noon.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'dnt': '1',
    'upgrade-insecure-requests': '1',
    'accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-dest': 'document'
}

response = requests.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905',headers=HEADERS,stream=True)
soup = BeautifulSoup(response.content,'lxml')
results = soup.find_all("div", {"class" : "productContainer"})
result = results[0]

print("https://www.noon.com" + result.a.get('href'))

Output

https://www.noon.com/uae-en

But the expected output should be 'https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f'

As here you can see from the browser

<div class="productContainer"><a class="sc-7vj7do-0 ftlAjW" href="/uae-en/product/N35521717A/p?o=f885efe0b6534e9f" id="productBox-N35521717A"><div class="kcs0h5-0 diNcmV grid" title="Samsung Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="e3js0d-1 efqIDW"><div class="productImage" data-qa-id="productImagePLP_Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="lazyload-wrapper"><div class="puv25r-0 hfEfTS"><div class="puv25r-2 hJKuPa"><img alt="Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE " src="https://a.nooncdn.com/t_desktop-pdp-v1/v1605814225/N35521717A_1.jpg"/></div></div></div></div><div class="e3js0d-2 dqjnoR"><div class="tagContainer"></div></div></div><div class="e3js0d-6 iKEZJh"><div class="e3js0d-7 jULUCI"><div class="e3js0d-10 cyUANN"><span class="e3js0d-11 gXshOX">Samsung</span>Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE </div></div><div class="e3js0d-8 jtiosv"><div class="sc-3751lm-0 hSumnU"><div class="sc-3751lm-1 eUJkVt large"><span class="currency">AED</span><strong>819.00</strong></div><div class="sc-3751lm-2 kWnsOk"><span class="oldPrice">AED<!-- --> <!-- -->859</span></div></div></div><div class="e3js0d-9 kDpjlW"><div class="e3js0d-12 gMFqig"><div class="u8zs36-0 kRPdZJ"><img alt="noon-express" height="20px" src="https://a.nooncdn.com/s/app/com/noon/images/fulfilment_express-en.png" width="80px"/></div></div></div></div></div></a></div>

Solution

  • What happens and steps to reproduce

    Website seems to deal with dynamically generated content.

    1. Open the website in browser

    2. Open source code ctrl + u search for class="productContainer" and you will see the href of <a> only contains /uae-en -> That is what you get by using requests

    3. Open inspector ctrl+shift+i and inspect your <a> and you will find the dynamically added part, what you get if you use selenium.

    Minimal example

    import time 
    from selenium import webdriver
    from selenium.webdriver.common.action_chains import ActionChains
    
    browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
    actions = ActionChains(browser)
    
    browser.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905')
    
    time.sleep(3)
    element = browser.find_element_by_xpath("//div[contains(@class, 'productContainer')]/a")
    
    actions.move_to_element(element).perform()
    print(element.get_attribute('href'))
    
    browser.close()
    

    Output

    https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f
    https://www.noon.com/uae-en/product/N41247213A/p?o=ca38c8921770ea2a
    https://www.noon.com/uae-en/product/N41247235A/p?o=c97b8bfdc0114cba
    https://www.noon.com/uae-en/product/N39790555A/p?o=d7354e20a0bb00ad
    https://www.noon.com/uae-en/product/N32046052A/p?o=faea2e69f38bbf6a
    ...
    

    EDIT

    You wont get the information with requests by scraping the source, but there is an alternativ way.

    You could use the api with requests and build the link (simple example you can customize):

    import requests
    
    url = "https://www.noon.com/_svc/catalog/api/u/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905"
    headers = {
        "user-agent": "Mozilla/5.0"
    }
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    
    records = response.json()["hits"]
    
    for record in records:
        offer_code = record["offer_code"]
        sku = record["sku"]
        url = record["url"]
        print(f"https://www.noon.com/uae-en/{url}/{sku}/p?o={offer_code}")
    

    Output

    https://www.noon.com/uae-en/galaxy-m31-dual-sim-blue-6gb-ram-128gb-4g-lte/N35521717A/p?o=f885efe0b6534e9f
    https://www.noon.com/uae-en/iphone-12-pro-max-with-facetime-128gb-pacific-blue-5g-international-specs/N41247213A/p?o=ca38c8921770ea2a
    https://www.noon.com/uae-en/iphone-12-pro-with-facetime-256gb-pacific-blue-5g-international-specs/N41247235A/p?o=cfab59c09cab747b
    ...