Search code examples
hadoophivehdfshadoop-partitioning

HDFS:Exact meaning of dfs.block.size


In our cluster the dfs.block.size is configured 128M, but I have seen quite a few files which is of the size of 68.8M which is a weird size. I have been confused on how exactly this configuration option affects how files look like on HDFS.

  • First thing I wish to make sure is that, will ideally files all of the size of the block size that already configured? Here I mean ideally file and block in a one-on-one mapping
  • If the files are not inherently small but are generated by MR jobs, what can be the possible cause of these small files?
  • One more point to add is that we are using the hive dynamic partitioning function which I am not sure if is one source of the problems. For the source of small files I have checked this blog but it The small files Problem

But the situations don't really match mine which makes my confusion remains. Hope anyone could give me some insight on that. Thanks a lot in advandce.


Solution

  • Files can be smaller than block, in this case it does not occupy the whole block size in filesystem. Read this answer: https://stackoverflow.com/a/14109147/2700344

    If you are using Hive with dynamic partition load, small files are often produced by reducers which are writing many partitions each.

    insert overwrite table mytable partition(event_date)
    select col1, col2, event_date 
     from some_table;
    

    For example if you running above command and there are totally 200 reducers on the last step and 20 different event_date partitions, then each reducer will create file in each partition. It will result in 200x20=4000 files.

    Why is it happens? Because the data is being distributed randomly between reducers, each reducer receives all partitions data and creates files in every partition.

    If you add distribute by partition key

    insert overwrite table mytable partition(event_date)
    select col1, col2, event_date 
     from some_table
    distribute by event_date;
    

    Then previous mapper step will group data according to distribute by, and reducers will receive the whole partition file and will create single file in each partition folder.

    You may add something else to the distribute by to create more files (and run more reducers for better parallelism). Read these related answers: https://stackoverflow.com/a/59890609/2700344, https://stackoverflow.com/a/38475807/2700344, Specify minimum number of generated files from Hive insert