Search code examples
rubyweb-scrapingnokogiri

Web Scraping with Nokogiri::HTML and Ruby - How to get output into an array?


I've just started with nokogiri to scrape info from a site and can't figure out how to get the following done. I have some HTML code I want to scrape:

    <div class="compatible_vehicles">
    <div class="heading">
    <h3>Compatible Vehicles</h3>
    </div><!-- .heading -->
    <ul>
            <li>
        <p class="label">Type1</p>
        <p class="data">All</p>
    </li>
    <li>
        <p class="label">Type2</p>
      <p class="data">All</p>
    </li>
    <li>
        <p class="label">Type3</p>
      <p class="data">All</p>
    </li>
    <li>
        <p class="label">Type4</p>
      <p class="data">All</p>
    </li>
    <li>
        <p class="label">Type5</p>
      <p class="data">All</p>
    </li>
    </ul>
    </div><!-- .compatible_vehicles -->

And I've managed to get the output I want on my screen with this:

    i = 0
     doc.css('div > .compatible_vehicles > ul > li').each do |item|  
      label = item.at_css(".label").text
      data = item.at_css(".data").text
     print "#{label} - #{data}" + ','
    end  
    i += 1

This gives me a list like this: Type1 - All,Type2 - All,Type3 - All,Type4 - All,Type5 - All, on the screen.

Now I want to get this value in an array to be able to save it to a CSV file. I've tried few things but most of the tries I get an 'Can't convert String to Array' error. Hope someone can help me out with this!


Solution

  • Starting with the HTML:

    html = '
    <div class="compatible_vehicles">
        <div class="heading">
            <h3>Compatible Vehicles</h3>
        </div><!-- .heading -->
        <ul>
            <li>
            <p class="label">Type1</p>
            <p class="data">All</p>
            </li>
            <li>
            <p class="label">Type2</p>
            <p class="data">All</p>
            </li>
            <li>
            <p class="label">Type3</p>
            <p class="data">All</p>
            </li>
            <li>
            <p class="label">Type4</p>
            <p class="data">All</p>
            </li>
            <li>
            <p class="label">Type5</p>
            <p class="data">All</p>
            </li>
        </ul>
    </div><!-- .compatible_vehicles -->
    '
    

    Parsing it with Nokogiri and looping over the <li> tags to get their <p> tag contents:

    require 'nokogiri'
    
    doc = Nokogiri::HTML(html)
    data = doc.search('.compatible_vehicles li').map{ |li|
      li.search('p').map { |p| p.text }
    }
    

    Returns an array of arrays:

    => [["Type1", "All"], ["Type2", "All"], ["Type3", "All"], ["Type4", "All"], ["Type5", "All"]]
    

    From there you should be able to plug that into the examples for the CSV class and get it to work with no trouble.

    Now, compare your code to output to the fields to the screen to this:

    data.map{ |a| a.join(' - ') }.join(', ')
    => "Type1 - All, Type2 - All, Type3 - All, Type4 - All, Type5 - All"
    

    All I'd have to do is puts and it'd print correctly.

    It's really important to think about returning useful data structures. In Ruby, hashes and arrays are very useful, because we can iterate over them and massage them into many forms. It'd be trivial, from the array of arrays, to create a hash:

    Hash[data]
    => {"Type1"=>"All", "Type2"=>"All", "Type3"=>"All", "Type4"=>"All", "Type5"=>"All"}
    

    Which would make it really easy to do lookups.