Search code examples
solrhdfsclouderaflume

Index the data in DFS


I have loaded the Data into HDFS using command hadoop fs -put.The Data is set of Rich documents like PDFs, doc and text files. How can i index this data so that i would be able to query it in Solr ?


Solution

  • Use apache Tika . It was created for extracting text and metadata from rich file formats like pdf or doc. Solr comes with the jar for tika included so all you need to do is have a quick look at the instructions for using the jar as a command line utility and you're good to go : http://tika.apache.org/1.5/gettingstarted.html