I'm making a Ruby web scraper to gather some info. In the HTML of the page that I want to scrape, there are 3 equal spans per article:
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
<div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
<span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
<div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
<span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
<div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
<span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
However, some of the articles don't have the last span (with "more details")
For now, I have been using this code:
#first loop to find the title
page.css('a.item-link').each do |line|
puts line.text
end
#Second loop to find the price
page.css('span.item-price').each do |line|
puts line.text
end
#third loop to find the details
page.css('span.item-detail').each do |line|
line.text
end
I'm using the Nokogiri gem and open-uri to retrieve and parse the file.
How can I concatenate the 3 spans (some articles only have two spans in the "item-detail" class) and print them in the screen?
My desired output is:
title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3
Some of the articles don't have the third span (with "more details n") so if that is the case i will print " ". My goal is to write the results to a .csv file
This is the code that works for the sample input, although I had to modify the input XML slightly to be contained within a single HTML node (<document>
) to be properly parseable:
require "nokogiri"
html = <<HTML
<document>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
<div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
<span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
<div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
<span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
<div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
<span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
</document>
HTML
page = Nokogiri::XML(html)
articles = page.css('article')
articles.each do |article|
article.css('a.item-link').each do |link|
puts "#{link[:title]}"
end
end
articles.each do |article|
article.css('span.item-price').each do |price|
puts "#{price.text}"
end
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[0].text}"
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[1].text}"
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[2] ? detail_spans[2].text.strip : ' '.inspect }"
end
This code retrieves an array of the article
elements, and then uses each article element in the array to scope additional queries for elements contained within. This gives the ability to do fine-grained reporting of individual element values.
The final item-detail
query uses element detection to determine how to output the values in the presence of elements that may not exist. Other queries may require such a technique, depending on the actual HTML document contents.
These are the results:
title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3