I have created an Apache Nutch Indexer Plugin to push data to Manticore Search using Manticore Search Java API.
The build is successful and all the crawling steps before indexing are succeeding (inject, generate, fetch, parse, updatedb).
When I run the indexing command bin/nutch index /root/nutch_source/crawl/crawldb/ -linkdb /root/nutch_source/crawl/linkdb/ -dir /root/nutch_source/crawl/segments/ -filter -normalize -deleteGone
it fails and logs/hadoop.log include the following stack trace.
I am running Nutch into a Docker container.
Nutch version in the image is 1.19
2021-09-07 10:15:46,040 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-09-07 10:16:23,666 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-09-07 10:17:36,020 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-09-07 10:17:36,378 INFO segment.SegmentChecker - Segment dir is complete: file:/root/nutch_source/crawl/segments/20210906001900.
2021-09-07 10:17:36,383 INFO segment.SegmentChecker - Segment dir is complete: file:/root/nutch_source/crawl/segments/20210906001655.
2021-09-07 10:17:36,387 INFO segment.SegmentChecker - Segment dir is complete: file:/root/nutch_source/crawl/segments/20210906002358.
2021-09-07 10:17:36,391 INFO indexer.IndexingJob - Indexer: starting at 2021-09-07 10:17:36
2021-09-07 10:17:36,401 INFO indexer.IndexingJob - Indexer: deleting gone documents: true
2021-09-07 10:17:36,402 INFO indexer.IndexingJob - Indexer: URL filtering: true
2021-09-07 10:17:36,402 INFO indexer.IndexingJob - Indexer: URL normalizing: true
2021-09-07 10:17:36,403 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: /root/nutch_source/crawl/crawldb
2021-09-07 10:17:36,407 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/root/nutch_source/crawl/segments/20210906001900
2021-09-07 10:17:36,408 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/root/nutch_source/crawl/segments/20210906001655
2021-09-07 10:17:36,410 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/root/nutch_source/crawl/segments/20210906002358
2021-09-07 10:17:36,411 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: /root/nutch_source/crawl/linkdb
2021-09-07 10:17:36,528 WARN impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2021-09-07 10:17:37,708 INFO mapreduce.Job - The url to track the job: http://localhost:8080/
2021-09-07 10:17:37,711 INFO mapreduce.Job - Running job: job_local250243852_0001
2021-09-07 10:17:38,724 INFO mapreduce.Job - Job job_local250243852_0001 running in uber mode : false
2021-09-07 10:17:38,725 INFO mapreduce.Job - map 0% reduce 0%
2021-09-07 10:17:39,731 INFO mapreduce.Job - map 100% reduce 0%
2021-09-07 10:17:47,677 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2021-09-07 10:17:47,992 INFO indexer.IndexWriters - Index writer org.apache.nutch.indexwriter.manticore.ManticoreIndexWriter identified.
2021-09-07 10:17:48,013 WARN mapred.LocalJobRunner - job_local250243852_0001
java.lang.Exception: java.lang.NoClassDefFoundError: com/manticoresearch/client/ApiException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)
Caused by: java.lang.NoClassDefFoundError: com/manticoresearch/client/ApiException
at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3137)
at java.base/java.lang.Class.getConstructor0(Class.java:3342)
at java.base/java.lang.Class.getConstructor(Class.java:2151)
at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:97)
at org.apache.nutch.indexer.IndexWriters.lambda$get$0(IndexWriters.java:60)
at java.base/java.util.Map.computeIfAbsent(Map.java:1003)
at org.apache.nutch.indexer.IndexWriters.get(IndexWriters.java:60)
at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:41)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:542)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: com.manticoresearch.client.ApiException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:105)
at org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:93)
at org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:73)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 19 more
2021-09-07 10:17:48,742 INFO mapreduce.Job - Job job_local250243852_0001 failed with state FAILED due to: NA
2021-09-07 10:17:48,773 INFO mapreduce.Job - Counters: 30
File System Counters
FILE: Number of bytes read=157397439
FILE: Number of bytes written=332518016
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=51223
Map output records=51223
Map output bytes=24049558
Map output materialized bytes=24158915
Input split bytes=2010
Combine input records=0
Combine output records=0
Reduce input groups=0
Input split bytes=2010
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=24158915
Reduce input records=0
Reduce output records=0
Spilled Records=51223
Shuffled Maps =14
Failed Shuffles=0
Merged Map outputs=14
GC time elapsed (ms)=125
Total committed heap usage (bytes)=5221908480
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=11426452
File Output Format Counters
Bytes Written=0
2021-09-07 10:17:48,774 ERROR indexer.IndexingJob - Indexing job did not succeed, job status:FAILED, reason: NA
2021-09-07 10:17:48,776 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Indexing job did not succeed, job status:FAILED, reason: NA
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:152)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:293)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:302)
I could resolve this issue by adding all the dependent libraries of ManticoreSearch to the plugin manifest plugin.xml
file inside the plugin folder.
I have found all the dependent JAR libraries listed in the folder runtime/local/plugins/<plugin-name>/
and took the name and included it under <runtime>
tag of the plugin.xml
.
After rebuilding the solution the indexer worked!