Search code examples
hadoophivehdfsimpala

How Hive reads data even after dropping from hdfs?


I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.

Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).

select count(*) from tbl_name -- returns double time

But ,

select count(col_name) from tbl_name -- returns actual count.

Same table when I tried from Impala after

INVALIDATE METADATA

I could see only data count which is available in HDFS(not duplicate).

How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?


Solution

  • Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.

    The solution is:

    1. to switch-off statistics usage in such cases:

      set hive.compute.query.using.stats=false;

    2. to analyze table as you mention in your comment:

      analyze table tbl_name partition(a,b,c) compute statistics;