Search code examples
rubyhttparty

Find a url in a document using regex in ruby


I have been trying to find a url in a html document and this has to be done in regex since the url is not in any html tag so I can't use nokogiri for that. To get the html i used httparty and i did it this way

require 'httparty'
doc = HTTParty.get("http://127.0.0.1:4040")
puts doc

That outputs the html code. And to get the url i used the .split() method to reach to the url. The full code is

require 'httparty'

doc = HTTParty.get('http://127.0.0.1:4040').split(".ngrok.io")[0].split('https:')[2]

puts "https:#{doc}.ngrok.io"

I wanted to do this using regex since ngrok might update their localhost html file and so this code won't work anymore. How do i do it?


Solution

  • If I understood correctly you want to find all hostnames matching "https://(any subdomain).ngrok.io", right ?

    If then you want to use String#scan with a regexp. Here is an example:

    # get your body (replace with your HTTP request)
    body = "my doc contains https://subdomain.ngrok.io and https://subdomain-1.subdomain.ngrok.io"
    puts body
    
    # Use scan and you're done
    urls = body.scan(%r{https://[0-9A-Za-z-\.]+\.ngrok\.io})
    puts urls
    

    It will result in an array containing ["https://subdomain.ngrok.io", "https://subdomain-1.subdomain.ngrok.io"]

    Call .uniq if you want to get rid of duplicates

    This doesn't handle ALL edge cases but it's probably enough for what you need