Search code examples
hadoopamazon-s3hiveminioapache-tez

hive-on-tez mapper stuck in INITIALIZING with total number of containers being -1 when accessing data on S3/MinIO


I have a Hadoop+Hive+Tez setup from scratch (meaning I deployed it component by component). Hive is set up using Tez as execution engine.

In its current status, Hive can access table on HDFS, but it can not access table stored on MinIO (using s3a filesystem implementation).

As shows the following screenshot, enter image description here when executing SELECT COUNT(*) FROM s3_table,

  • Tez execution stuck forever
  • Map 1 always in INITIALIZING state
  • Map 1 always has a total count of -1 and pending count of -1. (why -1?)

Things already checked:

  • Hadoop can access MinIO/S3 without problem. For example, hdfs dfs -ls s3a://bucketname works well.
  • Hive-on-Tez can compute against tables on HDFS, with mappers and reducers generated successfully and quickly.
  • Hive-on-MR can compute against tables on MinIO/S3 without problem.

What could be the possible causes for this problem?

Attaching Tez UI screenshot: enter image description here

Version informations:

  • Hadoop 3.2.1
  • Hive 3.1.2
  • Tez 0.9.2
  • MinIO RELEASE.2020-01-25T02-50-51Z

Solution

  • It turned out the problem is that Tez's S3 support must be enabled explicitly at compile time. For hadoop 2.8+, to enable S3 support, Tez must be compiled from source, with the following command:

    mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true -Paws -Phadoop28 -P\!hadoop27
    

    After that, drop the generated tez-x.y.z.tar.gz to HDFS and extract tez-x.x.x-minimal.tar.gz to $TEZ_LIB_DIR. Then it worked for me. Hive execution against MinIO/S3 runs smoothly.

    However, Tez installation guide didn't mention anything about enabling S3 support. Nor does the default Tez binary releases build with S3 or Azure support.

    The (hopefully) complete build options and pitfalls are actually documented in BUILDING.txt, where it says:

    However, to build against hadoop versions higher than 2.7.0, you will need to do the following:

    For Hadoop version X where X >= 2.8.0

    $ mvn package  -Dhadoop.version=${X} -Phadoop28 -P\!hadoop27
    

    For recent versions of Hadoop (which do not bundle aws and azure by default), you can bundle AWS-S3 (2.7.0+) or Azure (2.7.0+) support:

    $ mvn package -Dhadoop.version=${X} -Paws -Pazure