Search code examples
apache-sparkhadoophivehdfshadoop2

How to place HDFS file blocks with same / shared partitioning applied for different files / tables on same Data Node


I have two big tables partitioned by date column. They are saved as parquet files in hdfs. Every partition is divided by blocks of 64 MB and replicated 3 times accross the cluster machines. To optimize join operation I want to place the same date partitions on the same machines (any join key value is placed in one date partition only).

In Spark there is Partitioner object which can help to distribute blocks of different RDDs accross the cluster. Its pretty similar to my question but I'm afraid that after saving these RDD's file blocks may be shuffled by hdfs mechanism. Explaned: RDD is Spark instance and df method saveAsTable(...) calls (I suppose) some low-level functions which choose data nodes and replicate the data.

Can anyone help me to know if the blocks of my tables are distributed the right way?


Solution

  • The answer to your question is that one cannot control placement of "like / similar" data blocks in terms of partitioning for logically related files / tables expressly. I.e. you cannot influence on which data nodes data blocks are placed by HDFS.

    These partitions / chunks of data may coincidentally reside on same data nodes / workers (due to replication by HDFS.

    As an aside, with S3 such an approach does not work in any event as the concept of data locality optimization does not exist.