Search code examples
solrcygwinweb-crawlernutch

How to add some additional fields into solr when indexing from nutch?


I am using nutch 1.9 using cygwin and solr 4.8.0. I can index the crawled data into solr using below code.

bin/crawl urls/ crawlresult/ http://localhost:8983/solr/ 1

But i want to add some additional fields while indexing such as indexed_by, crawled_by, crawl_name, etc.
I need help on this.

Thanks in Advance.


Solution

  • If the value of the additional fields does not change, then you can use the Nutch's index-static plugin. It allows you to add a number of fields with their contents. You first need to enable it in nutch-site.xml. You then add the list of fields as shown below:

    <property>
     <name>index.static</name>
     <value>indexed_by:solr,crawled_by:nutch-1.8,crawl_name:nutch</value>
     <description>
      Used by plugin index-static to adds fields with static data at indexing time. 
       You can specify a comma-separated list of fieldname:fieldcontent per Nutch job.
      Each fieldcontent can have multiple values separated by space, e.g.,
       field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
       It can be useful when collections can't be created by URL patterns, 
      like in subcollection, but on a job-basis.
      </description>
    </property>
    

    If the value of these fields is not static and independent of indexed documents, then you will need to write a IndexingFilter plugin to do that. Have a look at the index-static plugin to know how implement yours.