Given a directory with hundreds of tab-delimited csv files, each of which contains no header in the first row. That means we will specify the column names by other means.These files can be located on a local disk, or HDFS.
What is the most efficient way to index these files?
if you have a lot of files , i think there are several methods to improve indexing speed :
First , if your data on a local disk , you can build index use multithreading , but need to pay attention to, each thread has its own index of an output directory. Finally merged them into an index so that improve search speed .
Second , if your data on HDFS , i think use Hadoop MapReduce to build index is very powerful . in addition , some UDF plugins of Pig or Hive also can build index easily , but you need convert your data into hive table or make pig schemal , these is simple !
Third , in order to better understand above methods , maybe you can read How to make indexing faster