Search code examples
parquetimpalatez

Impala 2.7 fails to read any data from a parquet table created from Hive with Tez


I'm populating a partitioned Hive table in parquet storage format using a query that is using a number of union all operators. Query is executed using Tez, which with default settings results in multiple concurrent Tez writers creating HDFS structure, where parquet files are sitting in subfolders (with Tez writer ID for the folder name) under partition folders. E.g. /apps/hive/warehouse/scratch.db/test_table/part=p1/8/000000_0

Even after invalidate metadata and collect stats on the table, Impala returns zero rows when the table is queried. The issue seems to be with Impala not traversing into partition subfolder to look for parquet files.

If I set hive.merge.tezfiles to true (it's false by default), effectively forcing Tez to use an extra processing step to coalesce multiple files into one, resulting parquet files are written directly under partition folder, and after refresh Impala can see the data in the new or updated partitions.

I wonder if there is an config option for Impala to instruct it to look in partition subfolders or perhaps there is a patch for Impala that changes its behavior in that regards.


Solution

  • As of now recursive reading of files from sub directories under the TABLE LOCATION is not supported in Impala. Example: If a table is created with location '/home/data/input/'

    and if the directory structure is as follows:

        /home/data/input/a.txt
        /home/data/input/b.txt
        /home/data/input/subdir1/x.txt
        /home/data/input/subdir2/y.txt
    

    then Impala can query from following files only

    /home/data/input/a.txt /home/data/input/b.txt

    Following files are not queried

      /home/data/input/subdir1/x.txt
        /home/data/input/subdir2/y.txt
    

    As a alternative solution, you can read the data from Hive and insert into a Final Hive Table.

    Create an Impala view on top of this table for Interactive or Reporting queries.

    You can set this feature in Hive using below configuration settings.

    Hive supports subdirectory scan with options

    SET mapred.input.dir.recursive=true;

    and

    SET hive.mapred.supports.subdirectories=true;