Search code examples
amazon-s3hiveparquet

Must run "analyze table" in hive else count(*) shows 0


I am creating an external Hive table on a parquet file on S3. The commands look like

CREATE EXTERNAL TABLE userinfo(
  user_id string,
  last_name string,
  first_name string
)
PARTITIONED BY (
  yr string,
  mo string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://mybucket/basedir/'
TBLPROPERTIES (
  'serialization.null.format'='');

alter table userinfo add IF NOT EXISTS partition (yr='2021', mo='07');

At this point, if I run "select count(*) from userinfo", I get 0 as the result. But if I then run

ANALYZE TABLE userinfo PARTITION(yr='2021', mo='07') COMPUTE STATISTICS;

and rerun the "select count(*)..." I get the expected row count.

This isn't a show-stopper, but it makes me think I'm doing something/failing to do something that's causing this strange behavior. Any insights are welcome.


Solution

  • You are not telling optimizer the statistics of the table so it shows 0 because it doesnt know anything about the data in it.
    Ideally you should gather statistics whenever you load data into a table/partition. Stale statistics or no statistics will keep optimizer thinking/guessing and thus making queries run longer/consume more memory and sometime return wrong values.
    Now, you can setup this property in hive for auto stats gathering -

    hive.stats.autogather=true
    

    Or you can gather them manually right after loading data into it.

    ANALYZE TABLE tab COMPUTE STATISTICS;