Search code examples
hadoophivehdfshadoop2apache-hive

Difference in HDFS data size and Hive Data Size


I have a table in Hive.

When I ran the command show tblproperties myTableName, It gives below result:

numFiles        12
numRows         1688092
rawDataSize     934923162
totalSize       936611254

That means rawDataSize is 934.92 MB and totalSize is 936.61 MB

And when I ran command to calculate data size on HDFS table location for the same table.

[user@server1 ~]$ hdfs dfs -du -h -s /apps/hive/warehouse/test.db/myTableName
893.2 M  /apps/hive/warehouse/test.db/myTableName

The result data size is 893.2 MB

I see that there is big difference in datasize here for the same table. I am trying to understand why there is difference in the data size here for the same table and looking for detailed explanation.

Table Type - MANAGED_TABLE

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1

Solution

  • 936611254 / 1024 / 1024 = 893.2 M