Search code examples
pythonhtmlweb-scrapingbeautifulsouphtml-parsing

Beautifulsoup unable to find more than 24 classes with find_all


I'm trying to scape data from page where all the items are stored like this

<div class="box browsingitem canBuy 123"> </div> <div class="box browsingitem canBuy 264"> </div>

There are hundreds of these but when I try to add them into array it only saves 24

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
import lxml

my_url = 'https://www.alza.co.uk/tablets/18852388.htm'


uClient = uReq(my_url)

page_html = uClient.read()

uClient.close()

page_soup = soup(page_html, "lxml")

classname = "box browsingitem"
containers = page_soup.find_all("div", {"class":re.compile(classname)})

#len(containers) will be equal to 24

for container in containers:    
    title_container = container.find_all("a",{"class":"name browsinglink"})
    product_name = title_container[0].text  
    print("product_name: " + product_name)

Is it a problem with re.compile? How else could I search for the classes?

Thanks for the help


Solution

  • So in this case, only 24 items are loaded in the DOM when you visit the page. The two options that occur to me are 1) use a headless browser to click the "load more" button and load more items onto the DOM or 2) create a simple pagination scheme and loop through those pages.

    Here is an example of the second option:

    for page in range(0, 10):
        print("Trying page # {}".format(page))
        if page == 0:
            my_url = 'https://www.alza.co.uk/tablets/18852388.html'
        else: 
            my_url = 'https://www.alza.co.uk/tablets/18852388-p{}.html'.format(page)
            requests.get(my_url)
    
        page_html = requests.get(my_url)
        page_soup = soup(page_html.content, "lxml")
        items = page_soup.find_all('div', {"class": "browsingitem"})
        print("Found a total of {}".format(len(items)))
        for item in items:
            title  = page_soup.find('a', 'browsinglink')
    

    You can see that the URLs have the pagination information built in, so all you need to do is determine how many pages you want to scrape, and you can save all that information. Here is the output:

    Trying page # 0
    Found a total of 24
    Trying page # 1
    Found a total of 24
    Trying page # 2
    Found a total of 24
    Trying page # 3
    Found a total of 24
    Trying page # 4
    Found a total of 24
    Trying page # 5
    Found a total of 24
    Trying page # 6
    Found a total of 24
    Trying page # 7
    Found a total of 24
    Trying page # 8
    Found a total of 17
    Trying page # 9
    Found a total of 0