Search code examples
hadoophbasehdfsflumebigdata

What is the best place to store multiple small files in hadoop


I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage? Because to store in HBase I need to parse it first then save it against some row key. In HDFS I can directly create a path and save that file at that location. But till now whatever I read, it says you should not have multiple small files instead create less big files. But I can not merge those files, so I can't create big file out of small files.

Kindly suggest.


Solution

  • A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.

    There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.

    To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java

    CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.

    Alternative to repackage small files are:

    1. Create sequence files where each record contains one of the small files. With this option you will keep the original files.
    2. Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
    3. Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.

    If you can not merge files like in option 2 or 3, my suggestion is to go with option 1