I have configured Apache Nutch 2.3.1 with Hadoop ecosystem. I have to fetch some person-arabic script websites. Nutch is giving exception for few URLs at fetch time. Following is an example exception
java.lang.IllegalArgumentException: Invalid uri 'http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html': escaped absolute path not valid
at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)
at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:77)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:173)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:245)
at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:564)
I've been able to reproduce this issue even on the 1.x branch. The problem is that the Java URI class that the Apache HTTP client library uses internally doesn't support non escaped UTF-8 characters:
From the JavaDoc documentation for java.net.URI
:
Character categories
RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:
- alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
- digit The US-ASCII decimal digit characters, '0' through '9'
- alphanum All alpha and digit characters unreserved All alphanum characters together with those in the string "_-!.~'()*"
- punct The characters in the string ",;:$&+="
- reserved All punct characters together with those in the string "?/[]@"
- escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
- other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the
Character.isISOControl
method), and are not space characters (according to theCharacter.isSpaceChar
method) (Deviation from RFC 2396, which is limited to US-ASCII)The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.
Properly escaped the URL would look more like:
Actually if you open the example URL on Chrome and then copy the URL from the address bar, you'll get the escaped representation. Feel free to open an issue for this (otherwise I'll do it). In the mean time you could try to use the protocol-http
plugin which does not uses the Apache HTTP client. I've tested locally and the parsechecker works fine:
➜ local (master) ✗ bin/nutch parsechecker "http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html"
fetching: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
robots.txt whitelist not configured.
parsing: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
contentType: text/html
signature: 048b390ab07464f5d61ae09646253529
---------
Url
---------------
http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: پیچ بند بادی هفتیری 1800 دور بادی جیسون-نیازمندی سفیرک
Outlinks: 76
outlink: toUrl: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html anchor:
outlink: toUrl: http://agahi.safirak.com/assets/fonts/font-awesome/css/font-awesome.min.css anchor:
outlink: toUrl: http://agahi.safirak.com/assets/css/bootstrap.css anchor:
...