I'm trying to scrape the below select dropdown menu in order to get the text content.
I cannot use the name as the "_P889O1" changes for each product I will be trying to extract the data from. However, I was thinking I could use 'contains' but im getting the error that the script is "not a vaild XPath expression"
Depending on the option values the total price changes so I believe a click is required here?
HTML being used:
<div class="GC75 ProductChoiceName" id="ProductChoiceName-%%" sf:object="ProductChoiceName" style="color: rgb(36, 36, 36);">
<select name="_P889O1Barrel length" onfocus="if(tf.core.idTextOptionBlur){clearTimeout(tf.core.idTextOptionBlur);tf.core.idTextOptionBlur=null;}if(tf.core.onblurcode){eval(tf.core.onblurcode);tf.core.onblurcode='';tf.core.setFocusID='';}" onclick="cancelBuble(event);if(tf.isInSF())return false;" onchange="tf.core.crFFldImager.replace('P889',this.value.split(core.str_sep1)[7]);var c = this.value;tf.core.crFFldOptPrc.updPrc('P889',this.value?this.value.split(core.str_sep1)[7]:'P889O1',crFFldArr,opt);dBasePrice2('P889',c);return false;" size="1">
<option value="">Barrel length *</option><option value="28"~|`0~|`0.00~|`50220~|`0.000000~|`0.000~|`~|`P889O1C1" origvalue="28"~|`0~|`0.00~|`50220~|`0.000000~|`0.000~|`~|`P889O1C1">28"</option>
<option value="30"~|`0~|`0.00~|`50222~|`0.000000~|`0.000~|`~|`P889O1C2" origvalue="30"~|`0~|`0.00~|`50222~|`0.000000~|`0.000~|`~|`P889O1C2">30"</option>
<option value="32"~|`0~|`0.00~|`50224~|`0.000000~|`0.000~|`~|`P889O1C3" origvalue="32"~|`0~|`0.00~|`50224~|`0.000000~|`0.000~|`~|`P889O1C3">32"</option>
</select></div>
Full code snippet:
from bs4 import BeautifulSoup
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select
os.environ["PYTHONIOENCODING"] = "utf-8"
#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)
#beautiful soup requests
URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", "GC62 Product")
barrels = soup.find_all('select', attrs={'name': re.compile('length')})
[[x['origvalue'][:2] for x in i.find_all('option')[1:]] for i in barrels]
for product in products:
#title
title = product.find("h3")
titleText = title.text if title else ''
#manufacturer name
manufacturer = product.find("div", "GC5 ProductManufacturer")
manuText = manufacturer.text if manufacturer else ''
#image location
img = product.find("div", "ProductImage")
imglinks = img.find("a") if img else ''
imglinkhref = imglinks.get('href') if imglinks else ''
imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
#description
description = product.find("div", "GC12 ProductDescription")
descText = description.text if description else ''
#more description
more = product.find("div", "GC12 ProductDetailedDescription")
moreText = more.text if more else ''
#price
spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
for i in range(0,len(spans),2):
span = spans[i].text
i+=1
#print(span)
#print(titleText)
#print(manuText)
#print(descText)
#print(moreText)
#print(imgurl.replace('..', ''))
#print("\n")
Both times ive included Print(x) just as a visual aid to show myself that its "working"
You'll need Selenium
as the dropdown menus are generated through Javascript.
2 suggestions: it takes some time for Selenium to dynamically load the page, so implement a time.sleep
to allow for this. Second, the xpath
syntax required a small change:
import time
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)
dropd = browser.find_element_by_xpath("//select[contains(@name, 'Barrel')]")
Output print(dropd.text)
:
Barrel length *
28"
30"
32"
Alternatively you can use BeautifulSoup
in combination with Selenium
:
import time
import re
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)
soup = BeautifulSoup(browser.page_source)
barrels = soup.find_all('select', attrs={'name': re.compile('length')})
[[x['origvalue'][:2] for x in i.find_all('option')[1:]] for i in barrels]
To make it fit your full code:
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)
soup = BeautifulSoup(browser.page_source)
products = soup.find_all("div", "GC62 Product")
for product in products:
#barrel lengths
barrels = product.find('select', attrs={'name': re.compile('length')})
if barrels:
barrels_list = [x['origvalue'][:2] for x in barrels.find_all('option')[1:]]
print(barrels_list)