I'm trying to scape data from page where all the items are stored like this
<div class="box browsingitem canBuy 123"> </div>
<div class="box browsingitem canBuy 264"> </div>
There are hundreds of these but when I try to add them into array it only saves 24
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
import lxml
my_url = 'https://www.alza.co.uk/tablets/18852388.htm'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "lxml")
classname = "box browsingitem"
containers = page_soup.find_all("div", {"class":re.compile(classname)})
#len(containers) will be equal to 24
for container in containers:
title_container = container.find_all("a",{"class":"name browsinglink"})
product_name = title_container[0].text
print("product_name: " + product_name)
Is it a problem with re.compile? How else could I search for the classes?
Thanks for the help
So in this case, only 24 items are loaded in the DOM when you visit the page. The two options that occur to me are 1) use a headless browser to click the "load more" button and load more items onto the DOM or 2) create a simple pagination scheme and loop through those pages.
Here is an example of the second option:
for page in range(0, 10):
print("Trying page # {}".format(page))
if page == 0:
my_url = 'https://www.alza.co.uk/tablets/18852388.html'
else:
my_url = 'https://www.alza.co.uk/tablets/18852388-p{}.html'.format(page)
requests.get(my_url)
page_html = requests.get(my_url)
page_soup = soup(page_html.content, "lxml")
items = page_soup.find_all('div', {"class": "browsingitem"})
print("Found a total of {}".format(len(items)))
for item in items:
title = page_soup.find('a', 'browsinglink')
You can see that the URLs have the pagination information built in, so all you need to do is determine how many pages you want to scrape, and you can save all that information. Here is the output:
Trying page # 0
Found a total of 24
Trying page # 1
Found a total of 24
Trying page # 2
Found a total of 24
Trying page # 3
Found a total of 24
Trying page # 4
Found a total of 24
Trying page # 5
Found a total of 24
Trying page # 6
Found a total of 24
Trying page # 7
Found a total of 24
Trying page # 8
Found a total of 17
Trying page # 9
Found a total of 0