require 'open-uri'
require 'json'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.highcharts.com/demo/"))
puts doc
But I want to be able to extract the json from this webpage, using regular expressions doesn't seem to work, and how to do extract JSON through XPath?
Here's how you can access the script tags (that don't reference an external file) from a URL:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri.HTML(open('http://www.highcharts.com/demo/'))
inline_script = doc.xpath('//script[not(@src)]')
inline_script.each do |script|
puts "-"*50, script.text
end
Now you just need to find the script block you want and extract just the data you want (using regex). Without more details, it's hard to guess what you want and are relying upon.
Here's a fairly fragile regex that finds what I'm guessing you were looking for:
inline = doc.xpath('//script[not(@src)]').map(&:text)
data = inline.map{ |js| js[/new Highcharts\.Chart\((.+?\})\);/m,1] }.compact[0]
puts data
Here's what you get out:
{
chart: {
renderTo: 'container',
defaultSeriesType: 'line',
marginRight: 130,
marginBottom: 25
},
title: {
text: 'Monthly Average Temperature',
x: -20 //center
},
subtitle: {
text: 'Source: WorldClimate.com',
x: -20
},
xAxis: {
categories: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
},
yAxis: {
title: {
text: 'Temperature (°C)'
},
plotLines: [{
value: 0,
width: 1,
color: '#808080'
}]
},
tooltip: {
formatter: function() {
return '<b>'+ this.series.name +'</b><br/>'+
this.x +': '+ this.y +'°C';
}
},
legend: {
layout: 'vertical',
align: 'right',
verticalAlign: 'top',
x: -10,
y: 100,
borderWidth: 0
},
series: [{
name: 'Tokyo',
data: [7.0, 6.9, 9.5, 14.5, 18.2, 21.5, 25.2, 26.5, 23.3, 18.3, 13.9, 9.6]
}, {
name: 'New York',
data: [-0.2, 0.8, 5.7, 11.3, 17.0, 22.0, 24.8, 24.1, 20.1, 14.1, 8.6, 2.5]
}, {
name: 'Berlin',
data: [-0.9, 0.6, 3.5, 8.4, 13.5, 17.0, 18.6, 17.9, 14.3, 9.0, 3.9, 1.0]
}, {
name: 'London',
data: [3.9, 4.2, 5.7, 8.5, 11.9, 15.2, 17.0, 16.6, 14.2, 10.3, 6.6, 4.8]
}]
}
Note that this is not JSON; this is a string representing JavaScript code with object, string, array, numeric, and function literals.