Search code examples
arraysrubyweb-scrapingconcatenationscraper

Ruby - Scraper concatenate strings


I'm making a Ruby web scraper to gather some info. In the HTML of the page that I want to scrape, there are 3 equal spans per article:

<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a> 
            <div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
            <span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span> 
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>
<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a> 
            <div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
            <span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>
<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a> 
            <div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
            <span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span> 
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>

However, some of the articles don't have the last span (with "more details")

For now, I have been using this code:

#first loop to find the title
page.css('a.item-link').each do |line|
    puts line.text
end
#Second loop to find the price
page.css('span.item-price').each do |line|
    puts line.text
end
#third loop to find the details
page.css('span.item-detail').each do |line|
    line.text
end

I'm using the Nokogiri gem and open-uri to retrieve and parse the file.

How can I concatenate the 3 spans (some articles only have two spans in the "item-detail" class) and print them in the screen?

My desired output is:

title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3

Some of the articles don't have the third span (with "more details n") so if that is the case i will print " ". My goal is to write the results to a .csv file


Solution

  • This is the code that works for the sample input, although I had to modify the input XML slightly to be contained within a single HTML node (<document>) to be properly parseable:

    require "nokogiri"
    
    html = <<HTML
    <document>
    <article>
       <div class="item item_contains_branding" data-adid="1234567">
          <div class="clearfix" style="display: block;">
             <div class="item-multimedia ">
                ...
             </div>
             <div class="item-info-container">
                <div class="logo-branding">
                ...
                </div>
                        <a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
                <div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
                <span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
                    <p class="item-description">description...</p>
                <div class="item-toolbar clearfix">
                ...
                </div>
             </div>
          </div>
       </div>
    </article>
    <article>
       <div class="item item_contains_branding" data-adid="1234567">
          <div class="clearfix" style="display: block;">
             <div class="item-multimedia ">
                ...
             </div>
             <div class="item-info-container">
                <div class="logo-branding">
                ...
                </div>
                        <a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
                <div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
                <span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
                    <p class="item-description">description...</p>
                <div class="item-toolbar clearfix">
                ...
                </div>
             </div>
          </div>
       </div>
    </article>
    <article>
       <div class="item item_contains_branding" data-adid="1234567">
          <div class="clearfix" style="display: block;">
             <div class="item-multimedia ">
                ...
             </div>
             <div class="item-info-container">
                <div class="logo-branding">
                ...
                </div>
                        <a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
                <div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
                <span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
                    <p class="item-description">description...</p>
                <div class="item-toolbar clearfix">
                ...
                </div>
             </div>
          </div>
       </div>
    </article>
    </document>
    HTML
    
    page  = Nokogiri::XML(html)
    articles = page.css('article')
    
    articles.each do |article|
      article.css('a.item-link').each do |link|
        puts "#{link[:title]}"
      end
    end
    
    articles.each do |article|
      article.css('span.item-price').each do |price|
        puts "#{price.text}"
      end
    end
    
    articles.each do |article|
      detail_spans = article.css('span.item-detail')
      puts "#{detail_spans[0].text}"
    end
    
    articles.each do |article|
      detail_spans = article.css('span.item-detail')
      puts "#{detail_spans[1].text}"
    end
    
    articles.each do |article|
      detail_spans = article.css('span.item-detail')
      puts "#{detail_spans[2] ? detail_spans[2].text.strip : ' '.inspect }"
    end
    

    This code retrieves an array of the article elements, and then uses each article element in the array to scope additional queries for elements contained within. This gives the ability to do fine-grained reporting of individual element values.

    The final item-detail query uses element detection to determine how to output the values in the presence of elements that may not exist. Other queries may require such a technique, depending on the actual HTML document contents.

    These are the results:

    title 1
    title 2
    title 3
    200€
    300€
    500€
    T2 
    T5 
    T1 
    20 m²
    50 m²
    100 m²
    more details 1
    " "
    more details 3