Search code examples
pythonbeautifulsoupcomboboxdropdownscreen-scraping

Scraping dropdown menu values by Python BeautifulSoup


I check most of the posts, but didnt find a reply for my small quation.

This the dropdown which i want to scrape:

<div class="input-box">
    <select name="super_attribute[138]" id="attribute138" class="required-entry super-attribute select form-control" onchange="notifyMe(this.value, this.options[this.selectedIndex].innerHTML);">
        <option value="">Choose an Option...</option>
        <option value="17" price="0">M (in stock) </option>
        <option value="18" price="0">L (out of stock) </option>
        <option value="15" price="0">XL (in stock) </option>
        <option value="52" price="0">XXL (in stock) </option>
    </select>
</div>

My Python Code is:

items = soup.select('option[value]')
values = [item.get('value') for item in items]
textvalues = [item.text for item in items]

print(textvalues)

And Output is : ['select', '(In-Stock)', '(Out-Stock)', '(In-Stock)', '(In-Stock)']

My request is i also need the other values (SizeValue & SizeName): 17 & M / 18 & L / 15 & XL / 52 & XXL

If i removed the .text , i have this output:

   <option value="">select</option>, <option value="200@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="201@#-(Out-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(Out-Stock)</option>, <option value="202@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="203@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>

Thanks for your help in advance.


Solution

  • It's quite simple, just add a + and also call item.text in your list-comprehension.

    Instead of:

    values = [item.get('value') for item in items]
    

    use:

    values = [item.get('value') + item.get_text(strip=True) for item in items[1:]]
    print(values)
    

    EDIT: The data is loaded dynamically so requests doesn't support it. But the data is available in JSON format on the website. You can extract it with a Regular Expression using the re module:

    import json
    import re
    import requests
    
    
    url = "https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html"
    response = requests.get(url).content
    
    regex_pattern = re.compile(r"Product\.Config\(({.*?})\);")
    data = json.loads(regex_pattern.search(str(response)).group(1))
    
    print(
        [
            product["id"] + product["label"]
            for product in data["attributes"]["138"]["options"]
        ]
    )
    

    Output:

    ['17M (in stock) ', '18L (out of stock) ', '15XL (in stock) ', '52XXL (in stock) ']