I am trying to scrape all the email addresses on a given site using a single file Ruby script. At the bottom of the file I have a hardcoded test-case using a URL that has an email address listed on that specific page (so it should find an email address on the first iteration of the first loop.
For some reason, my regex does not seem to be matching:
#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'
class GetEmails
def initialize
@urlCounter, @anemoneCounter = 0
$allUrls, $emailUrls, $emails = []
end
def has_email?(listingUrl)
hasListing = false
Anemone.crawl(listingUrl) do |anemone|
anemone.on_every_page do |page|
body_text = page.body.to_s
matchOrNil = body_text.match(/\A[^@\s]+@[^@\s]+\z/)
if matchOrNil != nil
$emailUrls[$anemoneCounter] = listingUrl
$emails[$anemoneCounter] = body_text.match
$anemoneCounter += 1
hasListing = true
else
end
end
end
return hasListing
end
end
emailGrab = GetEmails.new()
emailGrab.has_email?("http://genuinestoragesheds.com/contact/")
puts $emails[0]
\A
and \z
in your match beginning and end of string respectively. Obviously that webpage contains more that just an email string, or you wound't do the regex test at all.
You can simplify it to just /[^@\s]+@[^@\s]+/
, but you would still need to cleanup the string the extract the email.