Search code examples
pythonhtmlseleniumbeautifulsoupurllib2

Web Scrape - How do I click options based on select name using contains value?


I'm trying to scrape the below select dropdown menu in order to get the text content.

I cannot use the name as the "_P889O1" changes for each product I will be trying to extract the data from. However, I was thinking I could use 'contains' but im getting the error that the script is "not a vaild XPath expression"

Depending on the option values the total price changes so I believe a click is required here?

HTML being used:

<div class="GC75 ProductChoiceName" id="ProductChoiceName-%%" sf:object="ProductChoiceName" style="color: rgb(36, 36, 36);">
<select name="_P889O1Barrel length" onfocus="if(tf.core.idTextOptionBlur){clearTimeout(tf.core.idTextOptionBlur);tf.core.idTextOptionBlur=null;}if(tf.core.onblurcode){eval(tf.core.onblurcode);tf.core.onblurcode='';tf.core.setFocusID='';}" onclick="cancelBuble(event);if(tf.isInSF())return false;" onchange="tf.core.crFFldImager.replace('P889',this.value.split(core.str_sep1)[7]);var c = this.value;tf.core.crFFldOptPrc.updPrc('P889',this.value?this.value.split(core.str_sep1)[7]:'P889O1',crFFldArr,opt);dBasePrice2('P889',c);return false;" size="1">
<option value="">Barrel length&nbsp;*</option><option value="28&quot;~|`0~|`0.00~|`50220~|`0.000000~|`0.000~|`~|`P889O1C1" origvalue="28&quot;~|`0~|`0.00~|`50220~|`0.000000~|`0.000~|`~|`P889O1C1">28"</option>
<option value="30&quot;~|`0~|`0.00~|`50222~|`0.000000~|`0.000~|`~|`P889O1C2" origvalue="30&quot;~|`0~|`0.00~|`50222~|`0.000000~|`0.000~|`~|`P889O1C2">30"</option>
<option value="32&quot;~|`0~|`0.00~|`50224~|`0.000000~|`0.000~|`~|`P889O1C3" origvalue="32&quot;~|`0~|`0.00~|`50224~|`0.000000~|`0.000~|`~|`P889O1C3">32"</option>
</select></div>

Full code snippet:

from bs4 import BeautifulSoup
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select 
os.environ["PYTHONIOENCODING"] = "utf-8"

#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)

#beautiful soup requests
URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", "GC62 Product")


barrels = soup.find_all('select', attrs={'name': re.compile('length')})
[[x['origvalue'][:2] for x in i.find_all('option')[1:]] for i in barrels]


for product in products:
#title
    title = product.find("h3") 
    titleText = title.text if title else ''

#manufacturer name
    manufacturer = product.find("div", "GC5 ProductManufacturer")
    manuText = manufacturer.text if manufacturer else ''

 #image location
    img = product.find("div", "ProductImage")
    imglinks = img.find("a") if img else ''
    imglinkhref = imglinks.get('href')  if imglinks else ''
    imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
 
#description
    description = product.find("div", "GC12 ProductDescription")
    descText = description.text if description else ''

#more description
    more = product.find("div", "GC12 ProductDetailedDescription")
    moreText = more.text if more else ''

#price
    spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
    for i in range(0,len(spans),2):
        span = spans[i].text
        i+=1 

        #print(span)
        #print(titleText)
        #print(manuText)
        #print(descText)
        #print(moreText)
        #print(imgurl.replace('..', ''))
        #print("\n")

Both times ive included Print(x) just as a visual aid to show myself that its "working"


Solution

  • You'll need Selenium as the dropdown menus are generated through Javascript. 2 suggestions: it takes some time for Selenium to dynamically load the page, so implement a time.sleep to allow for this. Second, the xpath syntax required a small change:

    import time
    browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
    time.sleep(2)
    dropd = browser.find_element_by_xpath("//select[contains(@name, 'Barrel')]")
    

    Output print(dropd.text):

    Barrel length *
    28"
    30"
    32"
    

    Alternatively you can use BeautifulSoup in combination with Selenium:

    import time
    import re
    
    browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
    time.sleep(2)
    soup = BeautifulSoup(browser.page_source)
    barrels = soup.find_all('select', attrs={'name': re.compile('length')})
    [[x['origvalue'][:2] for x in i.find_all('option')[1:]] for i in barrels]
    

    To make it fit your full code:

    browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
    time.sleep(2)
    
    soup = BeautifulSoup(browser.page_source)
    products = soup.find_all("div", "GC62 Product")
    
    for product in products:
    #barrel lengths
      barrels = product.find('select', attrs={'name': re.compile('length')})
      if barrels:
        barrels_list = [x['origvalue'][:2] for x in barrels.find_all('option')[1:]]
        print(barrels_list)