Search code examples
hadoophivehiveqlavro

COUNT (*) returns outdated number of rows after INSERT INTO or TRUNCATE


I have a table with an avro schema where I insert tweets periodically.

CREATE TABLE tweets
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tweet_schema.avsc');

When I run

SELECT COUNT(*) FROM tweets;

I get only the count of the rows added by the last INSERT statement:

INSERT INTO TABLE tweets SELECT tweet FROM another_table;

Similarly, if I run

TRUNCATE TABLE tweets;

Querying for the count returns the table rows count before truncating, however, selecting everything returns nothing as expected.

Is this the expected behavior?

If it's indeed the expected behavior, what should I be doing to get the total number of rows?

I'm running Hive 0.13


Solution

  • It turned to be a bug, I reported it and got resolved in 0.14.0. You can watch the issue here HIVE-7213

    A workaround is to:

    set hive.compute.query.using.stats=false;