I have configured Nutch 2.3.1 with complete Hadoop/Hbase ecosystem on a small cluster. I am curious about scoring algorithm used in Nutch. I have found and used opic scoring filter in Nutch. To find its impect, I have check score at different steps in Nutch IN ( dbupdate and generate phase) as guided in Nutch WIKI. But I have found that every document score always remain zero no matter how may iteration I run and how many documents I fetch. Is there some problem in opic implementation or I am missing some of its configuration.
I have observed that _csh_
field that contains cash is removed at fetcher phase from corresponding table in Hbase.
I had resolved it by putting the changes in OPICScoringFilter.java
src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
I've put it in Markers as UTF8.
- row.getMetadata().put(CASH_KEY, ByteBuffer.wrap(Bytes.toBytes(score)));
+ row.getMarkers().put(CASH_KEY, new Utf8(Double.toString(score)));