Search code examples
web-crawlernutch

Not able to crawl a URL as there is special character


trying to crawl using NUTCH 1.17 but the URL is being rejected, there is #! in the URL example : xxmydomain.com/xxx/#!/xxx/abc.html

also I have tried to include

+^/

+^#! in my regex-urlfilter


Solution

    1. If you particularly check in the regex-normalize.xml file This particular rule file will be applied as part of urlnormalizer-regex plugin. This plugin is default included in plugin-includes in nutch-site.xml.

    As part of URL Normalizationg, This particular line will truncate URLs if anything present after URLFragment

    <!-- removes interpage href anchors such as site.com#location -->
    <regex>
      <pattern>#.*?(\?|&amp;|$)</pattern>
      <substitution>$1</substitution>
    </regex>
    

    You can disable this rule by commenting. (recommended way) (OR) you can remove urlnormalizer-regex from plugin-include conf from nutch-site.xml.

    1. There is one more place where the URL fragment part is ignored in the URL normalization part which is urlnormalizer-basic

    BasicURLNormalizer is used for applying general normalization on URL's(i.e removing multiple immediate slashes and properly encode using percent-encoding)

        public String normalize(String urlString, String scope)
          throws MalformedURLException {
        
        if ("".equals(urlString)) // permit empty
          return urlString;
    
        urlString = urlString.trim(); // remove extra spaces
    
        URL url = new URL(urlString);
    
        String protocol = url.getProtocol();
        String host = url.getHost();
        int port = url.getPort();
        String file = url.getFile();
    
        boolean changed = false;
        boolean normalizePath = false;
    
        if (!urlString.startsWith(protocol)) // protocol was lowercased
          changed = true;
    
        if ("http".equals(protocol) || "https".equals(protocol)
            || "ftp".equals(protocol)) {
    
          if (host != null && url.getAuthority() != null) {
            String newHost = normalizeHostName(host);
            if (!host.equals(newHost)) {
              host = newHost;
              changed = true;
            } else if (!url.getAuthority().equals(newHost)) {
              // authority (http://<...>/) contains other elements (port, user,
              // etc.) which will likely cause a change if left away
              changed = true;
            }
          } else {
            // no host or authority: recompose the URL from components
            changed = true;
          }
    
          if (port == url.getDefaultPort()) { // uses default port
            port = -1; // so don't specify it
            changed = true;
          }
    
          normalizePath = true;
          if (file == null || "".equals(file)) {
            file = "/";
            changed = true;
            normalizePath = false; // no further path normalization required
          } else if (!file.startsWith("/")) {
            file = "/" + file;
            changed = true;
            normalizePath = false; // no further path normalization required
          }
    
          if (url.getRef() != null) { // remove the ref
            changed = true;
          }
    
        } else if (protocol.equals("file")) {
          normalizePath = true;
        }
    
        // properly encode characters in path/file using percent-encoding
        String file2 = unescapePath(file);
        file2 = escapePath(file2);
        if (!file.equals(file2)) {
          changed = true;
          file = file2;
        }
    
        if (normalizePath) {
          // check for unnecessary use of "/../", "/./", and "//"
          if (changed) {
            url = new URL(protocol, host, port, file);
          }
          file2 = getFileWithNormalizedPath(url);
          if (!file.equals(file2)) {
            changed = true;
            file = file2;
          }
        }
    
        if (changed) {
          url = new URL(protocol, host, port, file);
          urlString = url.toString();
        }
    
        return urlString;
      }
    

    you can see from the code.. it is completely ignoring **url.getRef** Information which contains URLFragment.

    so, what we can do is just simply replace url = new URL(protocol, host, port, file);

    at the end of the normalize method(line number)

    with url = new URL(protocol, host, port, file+"#"+url.getRef());

    How did I validated?.

    scala> val url = new URL("https://www.codepublishing.com/CA/AlisoViejo/#!/AlisoViejo01/AlisoViejo01.html");
    url: java.net.URL = https://www.codepublishing.com/CA/AlisoViejo/#!/AlisoViejo01/AlisoViejo01.html
    
    scala> val protocol = url.getProtocol();
    protocol: String = https
    
    scala>     val host = url.getHost();
    host: String = www.codepublishing.com
    
    scala>     val port = url.getPort();
    port: Int = -1
    
    scala>     val file = url.getFile();
    file: String = /CA/AlisoViejo/
    
    scala> //when we construct back new url using the above information we end up loosing fragment information like shown in below
    
    scala> new URL(protocol, host, port, file).toString
    res69: String = https://www.codepublishing.com/CA/AlisoViejo/
    
    scala> //if we use url.getRef Information in constructing url we can retain back URL fragment information
    
    scala> //like shown below
    
    scala> new URL(protocol, host, port, file+"#"+url.getRef).toString
    res70: String = https://www.codepublishing.com/CA/AlisoViejo/#!/AlisoViejo01/AlisoViejo01.html
    
    scala> // so we can replace the url construction object as explained above to retain url fragment information
    

    Note: UrlFragment will provide local object references within the page. it does not make sense to crawl those URL's in most of the cases(that is why nutch normalize URL with the above rule) because HTML will remain the same.