Search code examples
hadoophivehadoop-yarnapache-tez

Apache Hive Not Returning YARN Application Results Correctly


I'm running a from-scratch cluster on AWS EC2. I have an external table (partitioned) defined with data on S3. I'm able to query this table and receive results to the console with a simple select * statement:

hive> set hive.execution.engine=tez;
hive> select * from external_table where partition_1='1' and partition_2='2';
<correct results returned>

Running a query that requires Tez doesn't return the results to the console:

hive> set hive.execution.engine=tez;
hive> select count(*) from external_table where partition_1='1' and partition_2='2';
Status: Running (Executing on YARN cluster with App id application_1572972524483_0012)

OK
+------+
| _c0  |
+------+
+------+
No rows selected (8.902 seconds)

However, if I dig in the logs and on the filesystem, I can find the results from that query:

(yarn.resourcemanager.log) org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1572972524483_0022      CONTAINERID=container_1572972524483_0022_01_000002      RESOURCE=<memory:1024, vCores:1>        QUEUENAME=default
(container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New Final Path: FS file:/tmp/<REALLY LONG FILE PATH>/000000_0
[root #] cat /tmp/<REALLY LONG FILE PATH>/000000_0
SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒j¹▒    2060

2060 is the correct count for the partition.

Now, oddly enough, I'm able to get the results from the application if I insert overwrite directory on HDFS:

hive> set hive.execution.engine=tez;
hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from external_table where partition_1='1' and partition_2='2';
[root #] hdfs dfs -cat /tmp/local_out/000000_0
2060

However, attempting to insert overwrite local directory fails:

hive> set hive.execution.engine=tez;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*) from external_table where partition_1='1' and partition_2='2';
[root #] cat /tmp/local_out/000000_0
cat: /tmp/local_out/000000_0: No such file or directory

If I cat the container result file for this query, it's only the number, no class name or special characters:

[root #] cat /tmp/<REALLY LONG FILE PATH>/000000_0
2060

The only out-of-place log message I can find comes from the YARN ResourceManager log:

(yarn.resourcemanager.log) INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1572972524483_0023      CONTAINERID=container_1572972524483_0023_01_000004      RESOURCE=<memory:1024, vCores:1>        QUEUENAME=default
(yarn.resourcemanager.log) WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     IP=NMIP   OPERATION=AM Released Container TARGET=Scheduler        RESULT=FAILURE    DESCRIPTION=Trying to release container not owned by app or with invalid id.    PERMISSIONS=Unauthorized access or invalid container    APPID=application_1572972524483_0023    CONTAINERID=container_1572972524483_0023_01_000004

My guess based on the vaguest of impressions is there's a character encoding problem when the results are written to the local filesystem (hence the special characters in the container result file) but it's really just a guess and I have no idea how to verify/tackle that issue. Any help is greatly appreciated!


Solution

  • Someone on the Apache Hive mailing list suggested this was being caused by the YARN container writing its results files to the local machine where it was running instead of HDFS. I did some digging in the source code and found that:

    mapreduce.framework.name=local
    

    which is the default in Hadoop 3.2.1, was causing the problem.

    Solved with:

    set mapreduce.framework.name=yarn