Search code examples
rubynokogiriopen-uri

Extract some JSON using Nokogiri


require 'open-uri'
require 'json'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.highcharts.com/demo/"))

puts doc

But I want to be able to extract the json from this webpage, using regular expressions doesn't seem to work, and how to do extract JSON through XPath?


Solution

  • Here's how you can access the script tags (that don't reference an external file) from a URL:

    require 'open-uri'
    require 'nokogiri'
    doc = Nokogiri.HTML(open('http://www.highcharts.com/demo/'))
    inline_script = doc.xpath('//script[not(@src)]')
    inline_script.each do |script|
      puts "-"*50, script.text
    end
    

    Now you just need to find the script block you want and extract just the data you want (using regex). Without more details, it's hard to guess what you want and are relying upon.

    Here's a fairly fragile regex that finds what I'm guessing you were looking for:

    inline = doc.xpath('//script[not(@src)]').map(&:text)
    data   = inline.map{ |js| js[/new Highcharts\.Chart\((.+?\})\);/m,1] }.compact[0]
    puts data
    

    Here's what you get out:

    {
      chart: {
        renderTo: 'container',
        defaultSeriesType: 'line',
        marginRight: 130,
        marginBottom: 25
      },
      title: {
        text: 'Monthly Average Temperature',
        x: -20 //center
      },
      subtitle: {
        text: 'Source: WorldClimate.com',
        x: -20
      },
      xAxis: {
        categories: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
      },
      yAxis: {
        title: {
          text: 'Temperature (°C)'
        },
        plotLines: [{
          value: 0,
          width: 1,
          color: '#808080'
        }]
      },
      tooltip: {
        formatter: function() {
                    return '<b>'+ this.series.name +'</b><br/>'+
            this.x +': '+ this.y +'°C';
        }
      },
      legend: {
        layout: 'vertical',
        align: 'right',
        verticalAlign: 'top',
        x: -10,
        y: 100,
        borderWidth: 0
      },
      series: [{
        name: 'Tokyo',
        data: [7.0, 6.9, 9.5, 14.5, 18.2, 21.5, 25.2, 26.5, 23.3, 18.3, 13.9, 9.6]
      }, {
        name: 'New York',
        data: [-0.2, 0.8, 5.7, 11.3, 17.0, 22.0, 24.8, 24.1, 20.1, 14.1, 8.6, 2.5]
      }, {
        name: 'Berlin',
        data: [-0.9, 0.6, 3.5, 8.4, 13.5, 17.0, 18.6, 17.9, 14.3, 9.0, 3.9, 1.0]
      }, {
        name: 'London',
        data: [3.9, 4.2, 5.7, 8.5, 11.9, 15.2, 17.0, 16.6, 14.2, 10.3, 6.6, 4.8]
      }]
    }
    

    Note that this is not JSON; this is a string representing JavaScript code with object, string, array, numeric, and function literals.