Search code examples
ruby-on-railsrubyopen-uri

How check correct url protocol in ruby?


I have list of 50,000 websites and I want to know what kind of protocol they have. All the website i have has all the names.com or like something.com but none of them have http://google.com. I did try to run the each and check manually like..

require 'rubygems'

require 'open-uri'
require 'io/console'
require 'open_uri_redirections'
require 'openssl'

OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE



filename = "./testfile.txt"
destination = File.open("./11aa.txt", "a")

newArray = Array.new
newArray = IO.readlines(filename)
newArray.each do |url|
begin
    puts "#{url}"
    if open(url,:read_timeout=>2 )
        destination.write "#{url}"      
    end

rescue => e
  puts e.message
end
    end

which did work but takes forever to finish. I am looking for better algorithm to check.

Thanks


Solution

  • "Protocol"? As in the IP protocol used to connect to a host as defined by the URL?

    require 'uri'
    
    URI.parse('http://foo.com').scheme # => "http"
    URI.parse('https://foo.com').scheme # => "https"
    URI.parse('ftp://foo.com').scheme # => "ftp"
    URI.parse('scp://foo.com').scheme # => "scp"
    

    If you want to know whether a site accepts HTTPS vs. HTTP, I'd start by checking for HTTPS, as the majority of sites allow HTTP:

    require 'net/http'
    
    %w[
      example.com
      www.example.com
      mail.google.com
      account.dyn.com
    ].each do |url|
      begin
        Net::HTTP.start(url, 443, :use_ssl => true) {}
        puts "#{url} is HTTPS"
      rescue
        puts "#{url} is HTTP"
      end
    end
    # >> example.com is HTTP
    # >> www.example.com is HTTP
    # >> mail.google.com is HTTPS
    # >> account.dyn.com is HTTPS
    

    Even though mail.google.com and account.dyn.com are HTTPS, if you test them for HTTP first, you'll see they also have that protocol. Some sites will redirect their HTTP request to their HTTPS server, others run both to allow a user to decide whether they want HTTP or HTTPS. You can test both protocols to figure out which cases are true.

    start doesn't require a block, but by providing an empty one it will automatically close the connection immediately after establishing it.

    Sites don't necessarily run their web services on ports 80 and 443. As a result, assuming the connection should be to one of those ports isn't necessarily right and could give you bad results if they use a different one. 8080 and 8081 are also often used so those should be checked too.

    Also, a site might respond on a port, but its content could be a redirect pointing you to the real port they want you to use, so you need to also consider whether you should only care about the connection succeeding, or look inside the HTTPd headers, or actually read the entire page returned, and parse it in case it's a software redirect.

    In other words, a connection succeeding doesn't tell you enough about what the site wants you to use, you'll have to conduct additional tests too.