I'm in the process of trying to scrape reddit (API-free) and I've run into a brick wall. On reddit, every page has a JSON representation that can be seen simply by appending .json
to the end, e.g. https://www.reddit.com/r/AskReddit.json
.
I installed NeatJS, and wrote a small chunk of code to clean the JSON up and print it:
require "rubygems"
require "json"
require "net/http"
require "uri"
require 'open-uri'
require 'neatjson'
url = ("https://www.reddit.com/r/AskReddit.json")
result = JSON.parse(open(url).read)
neatJS = JSON.neat_generate(result, wrap: 40, short: true, sorted: true, aligned: true, aroundColonN: 1)
puts neatJS
And it works fine:
(There's way more to that, it goes on for another few pages, the full JSON is here: http://pastebin.com/HDzFXqyU)
However, when I changed it to extract only the values I want:
url = ("https://www.reddit.com/r/AskReddit.json")
result = JSON.parse(open(url).read)
neatJS = JSON.neat_generate(result, wrap: 40, short: true, sorted: true, aligned: true, aroundColonN: 1)
neatJS.each do |data|
puts data["title"]
puts data["url"]
puts data["id"]
end
It gave me an error:
002----extractallaskredditthreads.rb:17:in `<main>': undefined method `each' for #<String:0x0055f948da9ae8> (NoMethodError)
I've been trying different variations of the extractor for about two days and none of them have worked. I feel like I'm missing something incredibly obvious. If anyone could point out what I'm doing wrong, that would be appreciated.
EDIT
It turns out I had the wrong variable name:
neatSJ =/= neatJS
However, correcting this only changes the error I got:
002----extractallaskredditthreads.rb:17:in `<main>': undefined method `each' for #<String:0x0055f948da9ae8> (NoMethodError)
And as I said, I have been attempting multiple ways of extracting the tags, which may have caused my typo.
In this code:
result = JSON.parse(open(url).read)
neatJS = JSON.neat_generate(result, wrap: 40, short: true, sorted: true, aligned: true, aroundColonN: 1)
...result
is a Ruby Hash object, the result of parsing the JSON into a Ruby object with JSON.parse
. Meanwhile, neatJS
is a String, the result of calling JSON.neat_generate
on the result
Hash. It doesn't make sense to call each
on a string. If you want to access the values inside the JSON structure, you want to use the result
object, not the neatJS
string:
children = result["data"]["children"]
children.each do |child|
puts child["data"]["title"]
puts child["data"]["url"]
puts child["data"]["id"]
end