Search code examples
htmlselenium-webdriverweb-scrapingscrapyextract

extract hidden links from web page


please check this link https://maroof.sa/businesses.

it is a link for website from which i want to extract links.

for example if you scroll down you would find a name for store "Marwa store" if you click on this card this will redirect you to the store page

now i need to scrap all the links for stores in the page " https://maroof.sa/businesses "

after inspection i found it hidden

i have successful extract the store name but i cant find the link

thanks in advance

import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By
from selenium import webdriver
from scrapy import Selector
import csv
driver = webdriver.Chrome()
driver.get(url="https://maroof.sa/businesses")
html = driver.page_source
names = driver.find_elements(By.CSS_SELECTOR , 'div.storeCard')

Solution

  • It's impossible to get business details from card info, however, it can be build by getting data from request with url part business/search .

    Business link can be built by pattern {url}/details/{id} where id can be got from response json object items.

    You can get needed response by using Chrome Dev Tools Protocol that is now available in Selenium.

    Also site has anti-scrapping mechanism, it doesn't load every time for me, so you need to use proxy / Undetected Selenium / etc. I added some stealth chrome options, but it doesn't help every time to avoid bot detection mechanism (site thinks that I'm a bot even in regular browser, so I think their bot detection is broken).

    import json
    import time
    
    from selenium import webdriver
    
    options = webdriver.ChromeOptions()
    options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
    
    def enable_stealth():
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-gpu")
        options.add_argument('--disable-blink-features=AutomationControlled')
        options.add_argument('--disable-dev-shm-usage')
        options.add_experimental_option("useAutomationExtension", False)
        options.add_argument("--enable-javascript")
        options.add_argument("--enable-cookies")
        options.add_argument('--disable-web-security')
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
    
    enable_stealth()
    driver = webdriver.Chrome(options)
    url = "https://maroof.sa/businesses"
    driver.get(url)
    logs = driver.get_log("performance")
    time.sleep(5)
    target_url = 'business/search'
    
    def get_links():
        for log in logs:
            message = log["message"]
            if "Network.responseReceived" not in message:
                continue
            params = json.loads(message)["message"].get("params")
            if params is None:
                continue
            response = params.get("response")
            if response is None or target_url not in response["url"]:
                continue
            body = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': params["requestId"]})
            items = json.loads(body['body'])['items']
            for item in items:
                link = f"{url}/details/{item['id']}"
                print(link)
    
    get_links()