Search code examples
rubyurinet-http

scanning a webpage for urls with ruby and regex


I'm trying to create an array of all links found at the below url. Using page.scan(URI.regexp) or URI.extract(page) returns more than just urls.

How do I get just the urls?

require 'net/http'
require 'uri'

uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)

Solution

  • If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:

    require 'nokogiri'
    require 'open-uri'
    
    # Parse the raw HTML text
    doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))
    
    # Extract all a-elements (HTML links)
    all_links = doc.css('a')
    
    # Sort + weed out duplicates and empty links
    links = all_links.map { |link| link.attribute('href').to_s }.uniq.
            sort.delete_if { |h| h.empty? }
    
    # Print out some of them
    puts links.grep(/store/)
    
    http://store.steampowered.com/app/214590/
    http://store.steampowered.com/app/218090/
    http://store.steampowered.com/app/220780/
    http://store.steampowered.com/app/226720/
    ...