Search code examples
rubyweb-scrapingnokogiri

Scraping the href value of anchor in Ruby


Working on this project where I have to scrape a "website," which is just a an html file in one of the local folders. Anyway, I've been trying to scrape down to the href value (a url) of the anchor tag for each student object. I am also scraping for other things, so ignore the rest. Here is what I have so far:

def self.scrape_index_page(index_url) #responsible for scraping the index page that lists all of the students
    #return an array of hashes in which each hash represents one student.
    html = index_url
    doc = Nokogiri::HTML(open(html))
    # doc.css(".student-name").first.text
    # doc.css(".student-location").first.text
    #student_card = doc.css(".student-card").first
    #student_card.css("a").text
end

enter image description here

Here is one of the student profiles. They are all the same, so I'm just interested in scraping the href url value.

<div class="student-card" id="eric-chu-card">
   <a href="students/eric-chu.html">
      <div class="view-profile-div">
         <h3 class="view-profile-text">View Profile</h3>
      </div>
      <div class="card-text-container">
         <h4 class="student-name">Eric Chu</h4>
         <p class="student-location">Glenelg, MD</p>
      </div>
   </a>
</div>

thanks for your help!


Solution

  • Once you get an anchor tag in Nokogiri, you can get the href like this:

    anchor["href"]
    

    So in your example, you could get the href by doing the following:

    student_card = doc.css(".student-card").first
    href = student_card.css("a").first["href"]
    

    If you wanted to collect all of the href values at once, you could do something like this:

    hrefs = doc.css(".student-card a").map { |anchor| anchor["href"] }