Search code examples
rubyregexurlweb-scrapinghpricot

Scrape URLs From Web


<a href="http://www.utoronto.ca/gdrs/" title="Rehabilitation Science"> Rehabilitation Science</a>

For the example above, I want to get the department name "Rehabilitation Science" and its homepage url "http://www.utoronto.ca/gdrs/" at the same time.

Could someone please suggest some smart regular expressions that would do the job for me?


Solution

  • There's no reason to use regex to do this at all. Here's a solution using Nokogiri, which is the usual Ruby HTML/XML parser:

    html = <<EOT
    <p><a href="http://www.example.com/foo">foo</a></p>
    <p><a href='http://www.example.com/foo1'>foo1</p></a>
    <p><a href=http://www.example.com/foo2>foo2</a></p>
    <p><a href = http://www.example.com/bar>bar</p>
    <p><a 
      href="http://www.example.com/foobar"
      >foobar</a></p>
      <p><a 
        href="http://www.example.com/foobar2"
        >foobar2</p>
    EOT
    
    require 'nokogiri'
    
    doc = Nokogiri::HTML(html)
    
    links = Hash[
      *doc.search('a').map { |a| 
          [
            a['href'],
            a.content
          ]
        }.flatten
      ]
    
    require 'pp'
    pp links
    # >> {"http://www.example.com/foo"=>"foo",
    # >>  "http://www.example.com/foo1"=>"foo1",
    # >>  "http://www.example.com/foo2"=>"foo2",
    # >>  "http://www.example.com/bar"=>"bar",
    # >>  "http://www.example.com/foobar"=>"foobar",
    # >>  "http://www.example.com/foobar2"=>"foobar2"}
    

    This returns a hash of URLs as keys with the related content of the <a> tag as the value. That means you'll only capture unique URLs, throwing away duplicates. If you want all URLs use:

    links = doc.search('a').map { |a| 
        [
          a['href'],
          a.content
        ]
      }
    

    which results in:

    # >> [["http://www.example.com/foo", "foo"],
    # >>  ["http://www.example.com/foo1", "foo1"],
    # >>  ["http://www.example.com/foo2", "foo2"],
    # >>  ["http://www.example.com/bar", "bar"],
    # >>  ["http://www.example.com/foobar", "foobar"],
    # >>  ["http://www.example.com/foobar2", "foobar2"]]
    

    I used a CSS accessor 'a' to locate the tags. I could use 'a[href]' if I wanted to grab only links, ignoring anchors.

    Regex are very fragile when dealing with HTML and XML because the markup formats are too freeform; They can vary in their format while remaining valid, especially HTML, which can vary wildly in its "correctness". If you don't own the generation of the file being parsed, then your code is at the mercy of whoever does generate it when using regex; A simple change in the file can break the pattern badly, resulting in a continual maintenance headache.

    A parser, because it actually understands the internal structure of the file, can withstand those changes. Notice that I deliberately created some malformed HTML but the code didn't care. Compare the simplicity of the parser version vs. a regex solution and think of long term maintainability.