Search code examples
pythonbeautifulsouphtml-parsing

Cannot parse address which contain ".html#/something" using bs4 in python3


My goal is to parse images from second page. I am using bf4 and Python3 for this. Please, look at those two pages:

1) Only page with images for all 4 colors (I can parse this page);

2) And page which contain images only for 1 color (chrom color in this example). I need to parse this page.

Using browser I can see that second page different from the first one. But, using bs4 I got similar results for first and second page as python didn't recognize this ".html#/kolor-chrom" in second page address.

First page address: "https://azzardo.com.pl/lampy-techniczne/2111-bross-1-tuba-lampa-techniczna-azzardo.html".

Second page address: "https://azzardo.com.pl/lampy-techniczne/2111-bross-1-tuba-lampa-techniczna-azzardo.html#/kolor-chrom".

Code to reproduce:

from bs4 import BeautifulSoup
import requests

adres1 = "https://azzardo.com.pl/lampy-techniczne/2111-bross-1-tuba-lampa-techniczna-azzardo.html"
adres2 = "https://azzardo.com.pl/lampy-techniczne/2111-bross-1-tuba-lampa-techniczna-azzardo.html#/kolor-chrom"

def parse_one_page(adres):
    """Parse one page and get all the img src from adres"""
    # Use headers to prevent hide our script
    headers = {'User-Agent': 'Mozilla/5.0'}
    # Get page
    page = requests.get(adres, headers=headers)  # read_timeout=5
    # Get all of the html code
    soup = BeautifulSoup(page.content, 'html.parser')
    # Find div
    divclear = soup.find_all("div", class_="clearfix")
    divclear = divclear[9]
    # Find img tag
    imgtag = [i.find_all("img") for i in divclear][0]
    # Find src
    src = [i["src"] for i in imgtag]
    # See how much images are here
    print(len(src))
    # return list with img src
    return src


print(parse_one_page(adres1))
print(parse_one_page(adres2))

After running those code you will see that output from those two addresses are similar: 24 images from both adresses. In first page here are 24 images (that's correct). But in second page here must be only 2 images, not 24 (incorrect)!

So hope, that someone help me how to parse second page in python3 using bs4 correctly.


Solution

  • Yep, looks like it's not possible to parse such responsive page using bs4