Search code examples
rubyweb-scrapingnokogirimechanize

How to scrape <script> tags with Nokogiri and Mechanize


I am attempting to scrape information from "St. Paul The Apostle Details Page". I need the address, phone number, and the description. All of this information is accessable through normal HTML tags that can be scraped using Nokogiri, however I found a block of info in a <script> tag.

<script type="application/ld+json">
          {
          "@context": "http://schema.org",
          "@type": "LocalBusiness",
          "address": {
          "@type":"PostalAddress",
          "streetAddress":"98-16 55th Avenue",
          "addressLocality":"Corona",
          "addressRegion":"NY",
          "postalCode": "11368"             
          },
          "name": "St. Paul The Apostle",
          "telephone":"(718) 271-1100",
          "image": "https://www.foodpantries.org/gallery/3101_st._paul_the_apostle_11368_idu.png",
          "description": "<b>Food Pantry Hours: </b><br>2nd and 4th week of the month <br>8:00am and open until food runs out <br>(usually people line up about 1 hour prior to 8 AM)<br><br><b>For more information, please call. </b><br>"
          }
        </script>

I was hoping to use this block of code to scrape all of the info I needed:

def self.scrape_info
  agent = Mechanize.new
  page = agent.get('https://www.foodpantries.org/li/st._paul_the_apostle_11368')
  street_address = agent.page.search('script').text
  puts street_address.to_s
end

How can I do this?


Solution

  • Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.

    This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:

    require 'json'
    require 'nokogiri'
    require 'open-uri'
    
    doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))
    

    At this point Nokogiri has a DOM created of the page in memory.

    Find the <script> node you want, and extract the text of the node:

    js = doc.at('script[type="application/ld+json"]').text
    

    at and search are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at and search and the tutorials.

    JSON is smart and allows us to use a shorthand of JSON[...] to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:

    JSON[js]
    # => {"@context"=>"https://schema.org",
    #     "@type"=>"Organization",
    #     "url"=>"https://www.foodpantries.org/",
    #     "sameAs"=>[],
    #     "contactPoint"=>
    #      [{"@type"=>"ContactPoint",
    #        "contactType"=>"customer service",
    #        "url"=>"https://www.foodpantries.org/ar/about",
    #        "email"=>"[email protected]"}]}
    

    Accessing a particular key/value pair is simple, just as with any other hash:

    foo = JSON[js]
    foo['url'] # => "https://www.foodpantries.org/"
    

    The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.