Search code examples
nullpointerexceptioncygwinnutch

Nutch 1.12 on Cygwin on Windows 7 - NullPointerException


I'm working to get nutch running for the first time for a work project. At this time, the plan is to run nutch from a single machine (Windows 7) to scrape context from a dozen or so web sites. Below is the command line output from cygwin.

$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-10-29 09:16:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
        at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
        at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
        at org.apache.nutch.crawl.Injector.run(Injector.java:467)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:441)

Looking through the source, here are lines 440 thru 443 of org.apache.nutch.crawl.Injector:

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
    System.exit(res);
  }

It's not clear exactly whether it is the NutchConfiguration.create() or the new Injector() which is failing there. I setup my installation from the tutorial on the nutch site. I put a list of 3 urls, 1 per line, in the file ./urls/seed.txt; and edited ./conf/nutch-site.xml.

Any suggestions for investigation/debugging this would be appreciated. Thank you!


Solution

  • Ok After somewhat struggling here are the final steps to get hadoop working with cygwin/windows.

    1. download the right version of winutils.exe and hadoop.dll under a folder bin from https://github.com/cdarlint/winutils based on hadoop version.

    2. set HADOOP_HOME to the download dir of bin folder above. (note if the above two files are downloaded in dir D:\winutil\bin then HADOOP_HOME = D:\winutil)

    3. make sure to add D:\winutil\bin to the PATH variable of windows. This step is important now (was not a while back).