I am developing an application in rails which requires checking whether a sitemap of the entered website's URL exists or not? For Eg if a user enters http://google.com then it should return "Sitemap present".I have seen for solutions that usually websites have either /sitemap.xml or /sitemap at the end of their URL.So i tried putting a check for this using typhoeus gem, checking response.code for the URL(like www.google.com/sitemap.xml OR www.apple.com/sitemap) that if it returns with a 200 or 301, then sitemap exists, else not.But i have found that some sites return a 301 even if they dont have a sitemap, they redirect it to their main page(For Eg http://yournextleap.com/sitemap.xml), hence i don't get a conclusive result.Any help would be really great. Here is my sample code to check for sitemap using typhoeus :
# the request object
request = Typhoeus::Request.new("http://apple.com/sitemap")
# Run the request via Hydra.
hydra = Typhoeus::Hydra.new
request.on_complete do |response|
if response.code == 301
p "success 301" # hell yeah
elsif response.code == 200
p "Success 200"
elsif response.code == 404
. puts "Could not get a sitemap, something's wrong."
else
p "check your input!!!!"
end
The HTTP response status code 301 Moved Permanently is used for permanent redirection. This status code should be used with the location header. RFC 2616 states that:
If a client has link-editing capabilities, it should update all references to the Request URI. The response is cachable. Unless the request method was HEAD, the entity should contain a small hypertext note with a hyperlink to the new URI(s). If the 301 status code is received in response to a request of any type other than GET or HEAD, the client must ask the user before redirecting.
I don't think its fair for you to assume that a 301 Response indicates that there was ever a sitemap. If you're checking the existence of a sitemap.xml or a sitemap directory then the correct response to expect is a 2XX.
If you're insistent on assuming that a 3XX request indicates a redirect to a sitemap, then follow the redirect and add logic to check the url of the page (if its the homepage) or the content of the page to see if it has XML structure.