I am struggling with an ill-constructed web-server log file, which I want to summarize to analyse attendance of the hosted site. Unfortunately for me, the architecture of the site is messy, so that there are no indexes of the hosted objects (html pages, jpg images, pdf document, etc.) while several URIs can refer to the same page. For example :
http://www.site.fr/main.asp?page=foo.htm
http://www.site.fr/storage-tree/foo.htm
http://www.site.fr/specific.asp?id=200
http://www.site.fr/specific.asp?path=/storage-tree/foo.htm
etc. without any obvious regularities between the duplicate URIs.
How, conceptually and pratically, can I efficiently identify the pages? As I see the problem, the idea is to construct an index linking log's URIs with a unique-object identifier constructed from http requests. There are three loose constraints :
This is pretty easy with httr:
library(httr)
HEAD("http://gmail.com")$url
You will probably also want to check the status_code
returned by HEAD, as failures often won't be redirected.
(One advantage of using httr over RCurl here is that it automatically preserves the connection across multiple http calls to the same site, which makes things quite a bit faster)