I am working on a javascript capable screen-scraper using capybara/dsl, selienium webdriver, and the spreadsheet gem. Very close to the desired output however two major problems arise:
I have not been able to figure out the exact xpath selector to filter out only the elements I'm looking for; to ensure that none are missing I am using a broad selector that I know will produce duplicate elements. I was planning on just calling .uniq on that selector but this throws an error. What is the proper way to do this results in the desired filtering. The error is an undefined no method for 'uniq'. Maybe I'm not using it properly: results = all("//a[contains(@onclick, 'analyticsLog')]").uniq
. I know that the xpath that I have chosen to extract hrefs: //a[contains(@onclick, 'analyticsLog')]
will define more nodes than I intended because using find to inspect the page elements shows 144 rather than 72 that make up the page results. I have looked for a more specific selector however I haven't been able to find one without filtering out some desired links due to the business logic used on the site.
My save_item method has two selectors that are not always found within the info results, I would like the script to just skip those that aren't found and save only the ones that are however my current iteration will throw a Capybara::ElementNotFound and exit. How could I configure this to work in the intended way.
code below
require "capybara/dsl"
require "spreadsheet"
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
@excel = Spreadsheet::Workbook.new
@work_list = @excel.create_worksheet
@row = 0
end
def go
visit_main_link
end
def visit_main_link
visit "http://www.some.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//a[contains(@onclick, 'analyticsLog')]")# I would like to use .uniq here to filter out the duplicates that I know will be delivered by this selector
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
@excel.write "inventory.csv"
end
def save_item
data = all("//*[@id='content-wrapper']/div[2]/div/div")
data.each do |info|
@work_list[@row, 0] = info.find("//*[@id='productright']/div/div[1]/h1").text
@work_list[@row, 1] = info.find("//div[contains(@class, 'price font left')]").text
@work_list[@row, 2] = info.find("//*[@id='productright']/div/div[11]").text
@work_list[@row, 3] = info.find("//*[@id='tabcontent1']/div/div").text.strip
@work_list[@row, 4] = info.find("//select[contains(@name, 'options[747]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
@work_list[@row, 5] = info.find("//select[contains(@name, 'options[748]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
@row = @row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
For Question 1: Get unique elements
All of the elements returned by all
are unique. Therefore, I assume by "unique" elements, you mean that the "onclick" attribute is unique.
The collection of elements returned by Capybara is an enumerable. Therefore, you can convert it to an array and then take the unique element's based on their onclick attribute:
results = all("//a[contains(@onclick, 'analyticsLog')]")
.to_a.uniq{ |e| e[:onclick] }
Note that it looks like the duplicate links are due to one for the image and one for the text below the image. You could scope your search to just one or the other and then you would not need to do the uniq check. To scope to just the text link, use the fact that the link is a child of an h5:
results = all("//h5/a[contains(@onclick, 'analyticsLog')]")
For Question 2: Capture text if element present
To solve your second problem, you could use first
to locate the element. This will return the matching element if one exists and nil if one does not. You could then save the text if the element is found.
For example:
el = info.first("//select[contains(@name, 'options[747]')]//*[@price='0']")
@work_list[@row, 4] = el.text if el
If you want the text of all matching elements, then use all
:
options = info.all(".//select[contains(@name, 'options[747]')]//*[@price='0']")
@work_list[@row, 4] = options.collect(&:text).join(', ')
When there are multiple matching options, you will get something like "Green, Pink". If there are no matching options, you will get "".