Search code examples
pythonweb-scrapingbeautifulsoupgoogle-colaboratory

Deep web data scraping with python in google colaboratory


I have the code from an answer to another question of mine.

It can pull data in each page. So, my next problem is how to drag data in each dress like model's name, model's size ,and features.

More than that there are more than one model in each dress (for example BOHO BIRD Amore Wrap Dress have a 3 model who wearing size 10, 14, and 16 for example.

    import json
        
    import requests
    from bs4 import BeautifulSoup

    cookies = {
        "ServerID": "1033",
        "__zlcmid": "10tjXhWpDJVkUQL",
    }
    
    headers = {
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
    }

    def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
        return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]

    def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
        return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
    
    all_pages = []
    for page in range(1, 29):
        print(f"{all_pages}\nFound: {len(all_pages)} dresses.")
    
        current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
        source = requests.get(current_page, headers=headers, cookies=cookies)
        soup = BeautifulSoup(source.content, 'html.parser')
    
        brand = extract_info(soup, tag="strong", attr_value="brand")
        name = extract_info(soup, tag="h2", attr_value="name")
        price = extract_info(soup, tag="span", attr_value="price")
    
        all_pages.extend(
            [
                {
                    "brand": b,
                    "name": n,
                    "price": p,
                } for b, n, p in zip(brand, name, price)
            ]
        )
    
    with open("all_the_dresses2.json", "w") as jf:
        json.dump(all_pages, jf, indent=4)

Solution

  • The information that you want is being generated dynamically. So, you won't get it with requests. I suggest you use selenium for that.

    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    import time
    
    
    link = 'https://www.birdsnest.com.au/brands/boho-bird/73067-amore-wrap-dress'
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome('C:/Users/../Downloads/../chromedriver.exe', options=options)
    driver.get(link)
    time.sleep(3)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    page_new = soup.find('div', class_='model-info clearfix')
    results = page_new.find_all('p')
    for result in results:
        print(result.text)
    

    Output

    Marnee usually wears a size 8.
                    She is wearing a size 10 in this style.
                  
    Her height is 178 cm.
    
    Show Marnee’s body measurements
    
    Marnee’s body measurements are:
    Bust 81 cm
    Waist 64 cm
    Hips 89 cm
    

    <div class="model-info-header">
                  <p>
                    <strong><span class="model-info__name">Marnee</span></strong> usually wears a size <strong><span class="model-info__standard-size">8</span></strong>.
                    She is wearing a size <strong><span class="model-info__wears-size">10</span></strong> in this style.
                  </p>
                  <p class="model-info-header__height">Her height is <strong><span class="model-info__height">178 cm</span></strong>.</p>
                  <p>
                    <span class="js-model-info-more model-info__link model-info-header__more">Show <span class="model-info__name">Marnee</span>’s body measurements</span>
                  </p>
                </div>

    With requests you will miss all the data in BOLD which is what you want.