I am working on a Web scraping project. The URL for the website I am scraping is https://www.beliani.de/sofas/ledersofa/
I am scraping all the links of products listed on this page. I tried getting links using both Requests-HTML and Selenium. But I am getting 57 and 24 links respectively. While there are more than 150 products listed on the page. Below are the code blocks I am using.
Using Selenium:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")
#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)
url = 'https://www.beliani.de/sofas/ledersofa/'
driver.get(url)
sleep(20)
links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
print(a)
links.append(a)
print(len(links))
Using Request-HTML:
from requests_html import HTMLSession
url = 'https://www.beliani.de/sofas/ledersofa/'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep = 20)
products = r.html.xpath('//*[@id="offers_div"]', first = True)
#Getting 57 links using below block:
links = []
for link in products.absolute_links:
print(link)
links.append(link)
print(len(links))
I am not getting which step I am doing wrong or what is missing.
You have to scroll through the website and reach the end of the page in order to load all the scripts in the webpage. Just by opening the website we will load only the script that is necessary to view that particular section of the webpage. Therefore when you ran your code it could retrieve data from only those scripts that were loaded.
This one gave me 160 links :
driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)
#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')
# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
scroll_height = scroll_height + (height/10)
driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
sleep(2)
# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
if i.get_attribute('href') is not None:
print(i.get_attribute('href'))
count+=1
print(count)
driver.quit()