trying to crawl using NUTCH 1.17 but the URL is being rejected, there is #! in the URL example :!/xxx/abc.html
also I have tried to include
+^#! in my regex-urlfilter
As part of URL Normalizationg, This particular line will truncate URLs if anything present after URLFragment
<!-- removes interpage href anchors such as -->
You can disable this rule by commenting. (recommended way) (OR) you can remove urlnormalizer-regex from plugin-include conf from nutch-site.xml.
BasicURLNormalizer is used for applying general normalization on URL's(i.e removing multiple immediate slashes and properly encode using percent-encoding)
public String normalize(String urlString, String scope)
throws MalformedURLException {
if ("".equals(urlString)) // permit empty
return urlString;
urlString = urlString.trim(); // remove extra spaces
URL url = new URL(urlString);
String protocol = url.getProtocol();
String host = url.getHost();
int port = url.getPort();
String file = url.getFile();
boolean changed = false;
boolean normalizePath = false;
if (!urlString.startsWith(protocol)) // protocol was lowercased
changed = true;
if ("http".equals(protocol) || "https".equals(protocol)
|| "ftp".equals(protocol)) {
if (host != null && url.getAuthority() != null) {
String newHost = normalizeHostName(host);
if (!host.equals(newHost)) {
host = newHost;
changed = true;
} else if (!url.getAuthority().equals(newHost)) {
// authority (http://<...>/) contains other elements (port, user,
// etc.) which will likely cause a change if left away
changed = true;
} else {
// no host or authority: recompose the URL from components
changed = true;
if (port == url.getDefaultPort()) { // uses default port
port = -1; // so don't specify it
changed = true;
normalizePath = true;
if (file == null || "".equals(file)) {
file = "/";
changed = true;
normalizePath = false; // no further path normalization required
} else if (!file.startsWith("/")) {
file = "/" + file;
changed = true;
normalizePath = false; // no further path normalization required
if (url.getRef() != null) { // remove the ref
changed = true;
} else if (protocol.equals("file")) {
normalizePath = true;
// properly encode characters in path/file using percent-encoding
String file2 = unescapePath(file);
file2 = escapePath(file2);
if (!file.equals(file2)) {
changed = true;
file = file2;
if (normalizePath) {
// check for unnecessary use of "/../", "/./", and "//"
if (changed) {
url = new URL(protocol, host, port, file);
file2 = getFileWithNormalizedPath(url);
if (!file.equals(file2)) {
changed = true;
file = file2;
if (changed) {
url = new URL(protocol, host, port, file);
urlString = url.toString();
return urlString;
you can see from the code.. it is completely ignoring **url.getRef**
Information which contains URLFragment.
so, what we can do is just simply replace url = new URL(protocol, host, port, file);
at the end of the normalize method(line number)
with url = new URL(protocol, host, port, file+"#"+url.getRef());
How did I validated?.
scala> val url = new URL("!/AlisoViejo01/AlisoViejo01.html");
url: =!/AlisoViejo01/AlisoViejo01.html
scala> val protocol = url.getProtocol();
protocol: String = https
scala> val host = url.getHost();
host: String =
scala> val port = url.getPort();
port: Int = -1
scala> val file = url.getFile();
file: String = /CA/AlisoViejo/
scala> //when we construct back new url using the above information we end up loosing fragment information like shown in below
scala> new URL(protocol, host, port, file).toString
res69: String =
scala> //if we use url.getRef Information in constructing url we can retain back URL fragment information
scala> //like shown below
scala> new URL(protocol, host, port, file+"#"+url.getRef).toString
res70: String =!/AlisoViejo01/AlisoViejo01.html
scala> // so we can replace the url construction object as explained above to retain url fragment information
Note: UrlFragment will provide local object references within the page. it does not make sense to crawl those URL's in most of the cases(that is why nutch normalize URL with the above rule) because HTML will remain the same.