I'm scraping Techcrunch.com and grabbing the title, URL and preview text of each article.
I have:
require 'nokogiri'
require 'open-uri'
class TestScraper::Scraper
@doc = Nokogiri::HTML(open("https://techcrunch.com")
def scrape_tech_crunch
articles = @doc.css("h2.post-block__title").css("a")
top_stories = articles.each do |story|
stories = {
:title => story.children.text.strip,
:url => story.attribute("href").value,
:preview => @doc.css("div.post-block__content").children.first.text
}
TestScraper::Article.new(stories)
end
end
end
TestScraper::Article.new(stories)
takes the hash as an argument and uses it to initialize the Article class:
class TestScraper::Article
attr_accessor :title, :url, :preview
@@all = []
def initialize(hash)
hash.each do |k, v|
self.send "#{k}=", v
end
@@all << self
end
def self.all
@@all
end
end
When I run TestScraper::Scraper.new("https://techcrunch.com").scrape_tech_crunch
I get:
[#<TestScraper::Article:0x00000000015f69e0
@preview=
"\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
@title=
"Millions downloaded dozens of Android apps on Google Play infected with adware",
@url=
"https://techcrunch.com/2019/10/24/millions-dozens-android-apps-adware/">,
#<TestScraper::Article:0x00000000015f5658
@preview=
"\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
@title="Netflix launches $4 mobile-only monthly plan in Malaysia",
@url=
"https://techcrunch.com/2019/10/24/netflix-malaysia-mobile-only-cheap-plan/">
It creates an object with the appropriate title and URL for each instance of the article class, but it keeps assigning the same preview text to each article instance. There should be 20 articles each with its own "preview" (the small sample of the article you get before you click on the link to read the full article).
The issue you're having is due to the fact that
@doc.css("div.post-block__content").children.first.text
selects the same node for each story, since you call it on @doc
which is the global document.
Instead try to find the top most common node, and travel down from there:
@doc.css('.post-block').map do |story|
# navigate down from the selected node
title = story.at_css('h2.post-block__title a')
preview = story.at_css('div.post-block__content')
TestScraper::Article.new(
title: title.content.strip,
href: title['href'],
preview: preview.content.strip
)
end
If any of the used methods raises questions have a look at the Nokogiri cheat sheet. If you have any questions after that don't be afraid to ask about it in the comments.