Search code examples
rubyhtml-parsingnokogiri

How to assign a Nokogiri Element to hash key


I'm scraping Techcrunch.com and grabbing the title, URL and preview text of each article.

I have:

require 'nokogiri'
require 'open-uri'

class TestScraper::Scraper
@doc = Nokogiri::HTML(open("https://techcrunch.com")

  def scrape_tech_crunch
    articles = @doc.css("h2.post-block__title").css("a")
    top_stories = articles.each do |story|
      stories = {
        :title => story.children.text.strip,
        :url => story.attribute("href").value,
        :preview => @doc.css("div.post-block__content").children.first.text
      }
      TestScraper::Article.new(stories)
    end
  end
end

TestScraper::Article.new(stories) takes the hash as an argument and uses it to initialize the Article class:

class TestScraper::Article
  attr_accessor :title, :url, :preview 

  @@all = []

  def initialize(hash)
    hash.each do |k, v|
      self.send "#{k}=", v
    end
    @@all << self
  end

  def self.all
    @@all
  end
end

When I run TestScraper::Scraper.new("https://techcrunch.com").scrape_tech_crunch I get:

[#<TestScraper::Article:0x00000000015f69e0
  @preview=
   "\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
  @title=
   "Millions downloaded dozens of Android apps on Google Play infected with adware",
  @url=
   "https://techcrunch.com/2019/10/24/millions-dozens-android-apps-adware/">,
 #<TestScraper::Article:0x00000000015f5658
  @preview=
   "\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
  @title="Netflix launches $4 mobile-only monthly plan in Malaysia",
  @url=
   "https://techcrunch.com/2019/10/24/netflix-malaysia-mobile-only-cheap-plan/">

It creates an object with the appropriate title and URL for each instance of the article class, but it keeps assigning the same preview text to each article instance. There should be 20 articles each with its own "preview" (the small sample of the article you get before you click on the link to read the full article).


Solution

  • The issue you're having is due to the fact that

    @doc.css("div.post-block__content").children.first.text
    

    selects the same node for each story, since you call it on @doc which is the global document.

    Instead try to find the top most common node, and travel down from there:

    @doc.css('.post-block').map do |story|
      # navigate down from the selected node
      title   = story.at_css('h2.post-block__title a')
      preview = story.at_css('div.post-block__content')
    
      TestScraper::Article.new(
        title:   title.content.strip,
        href:    title['href'],
        preview: preview.content.strip
      )
    end
    

    If any of the used methods raises questions have a look at the Nokogiri cheat sheet. If you have any questions after that don't be afraid to ask about it in the comments.