Search code examples
hadoopindexingsolrclouderasolrcloud

How to index all csv files in a directory with Solr?


Given a directory with hundreds of tab-delimited csv files, each of which contains no header in the first row. That means we will specify the column names by other means.These files can be located on a local disk, or HDFS.

What is the most efficient way to index these files?


Solution

  • if you have a lot of files , i think there are several methods to improve indexing speed :

    First , if your data on a local disk , you can build index use multithreading , but need to pay attention to, each thread has its own index of an output directory. Finally merged them into an index so that improve search speed .

    Second , if your data on HDFS , i think use Hadoop MapReduce to build index is very powerful . in addition , some UDF plugins of Pig or Hive also can build index easily , but you need convert your data into hive table or make pig schemal , these is simple !

    Third , in order to better understand above methods , maybe you can read How to make indexing faster