Search code examples
hadoophivereplication

Total number of replicated files after copying hdfs file into hive table


Suppose if i load a file which is in hdfs into hive table then what are the total replicas of that file. In hdfs file is replicated 3 times and now copying to hive table results in additional replicas which sums up to 6 replicas or not??


Solution

  • In HDFS, number of replicas are based on the replication factor set. In your case, since replication factor is 3, there will be three copies.

    When you do a sqoop import from hdfs to hive(into internal table), the data is copied only from one location on hdfs to a table in hive. But the replication of Hive data again happens based on your replication factor.

    In total you will end up with 3(hdfs) + 1(hive copy)*3 => 3copies on HDFS and 3 copies of data stored by hive(this is not 6 copies, as hive doesn't store data in the same file format).

    OR

    If you do a LOAD DATA INPATH into an internal table the old copy is lost and only the newer hive copy exists. So you will end up with only a hive table(and its replicated copies).

    In your case, 3 hive table copies(as rep is set to 3).

    OR

    If you create an external table, no new copy is created. Only meta of the data is created by Hive. So you end up with your HDFS copies + Hive meta storage copies.

    In your case, 3 copies in HDFS + 3 copies of the meta data stored on Hive.