I am running my hive query on EMR cluster that which is 25 nodes cluster and i have used r4.4xlarge in stances to run this .
When i run my query i get below error .
Job Commit failed with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: FEAF40B78D086BEE; S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=), S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=)'
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/var/lib/hadoop/steps/s-10YQZ5Z5PRUVJ/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-east-1.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f" "s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-22_App/d77a6a82-26f4-4f06-a1ea-e83677256a55/01/DeltaOutPut/processing/Scripts/script.sql" (RuntimeError)
Command exiting with ret '1'
I have tried settings all king of HIVE parameter combinations like below
emrfs-site fs.s3.consistent.retryPolicyType exponential
emrfs-site fs.s3.consistent.metadata.tableName EmrFSMetadataAlt
emrfs-site fs.s3.consistent.metadata.write.capacity 300
emrfs-site fs.s3.consistent.metadata.read.capacity 600
emrfs-site fs.s3.consistent true
hive-site hive.exec.stagingdir .hive-staging
hive-site hive.tez.java.opts -Xmx47364m
hive-site hive.stats.fetch.column.stats true
hive-site hive.stats.fetch.partition.stats true
hive-site hive.vectorized.execution.enabled false
hive-site hive.vectorized.execution.reduce.enabled false
hive-site tez.am.resource.memory.mb 15000
hive-site hive.auto.convert.join false
hive-site hive.compute.query.using.stats true
hive-site hive.cbo.enable true
hive-site tez.task.resource.memory.mb 16000
But every time it failed . I tried increasing the no of nodes/bigger instances in the EMR cluster but result is still same .
I also tried with and without Tez but still did not worked for me .
Here is my sample query .I am just copying the part of my query
insert into filediffPcfp.TableDelta
Select rgt.FILLER1,rgt.DUNSNUMBER,rgt.BUSINESSNAME,rgt.TRADESTYLENAME,rgt.REGISTEREDADDRESSINDICATOR
Please help me identify the issue .
Adding full yarn logs
2019-02-26 06:28:54,318 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to temp file: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_task_tmp.-ext-10000/_tmp.000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: New Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,681 [INFO] [TezChild] |exec.FileSinkOperator|: FS[11]: records written - 1
2019-02-26 06:28:54,877 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000
2019-02-26 06:28:56,632 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 10000
2019-02-26 06:29:13,301 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 100000
2019-02-26 06:31:59,207 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000000
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1551161362408_0001_7_01_000000_1 due to an invocation of shutdownRequested
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Received abort
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Forwarding abort to RecordProcessor
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.MapRecordProcessor|: Forwarding abort to mapOp: {} MAP
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |exec.MapOperator|: Received abort in operator: MAP
2019-02-26 06:34:42,705 [INFO] [TezChild] |s3.S3FSInputStream|: Encountered exception while reading '2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/IncrFile/WB.ACTIVE.OCT17_01_OF_10.gz', will retry by attempting to reopen stream.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at com.amazon.ws.emr.hadoop.fs.s3n.InputStreamWithInfo.read(InputStreamWithInfo.java:173)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:136)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:255)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
Let me answer my own question .
First thing very important that we have noticed while running HIVE jobs on EMR is that STEP error is misleading vertex failed will not point you in the correct direction .
So better to check for hive logs.
Now if our instance is terminated, then we can not log into master instance and see the logs , in that case we have to look for nodes application logs .
Here is how we can find that nodes logs .
Get the master instance id something like this (i-04d04d9a8f7d28fd1) and with that search for in nodes .
Then open below path
/applications/hive/user/hive/hive.log.gz
Here you can find the expected error .
Also we have to look for the containers logs for the failed nodes ,failed nodes details can be found in master instance node.
hadooplogs/j-25RSD7FFOL5JB/node/i-03f8a646a7ae97aae/daemons/
This daemons nodes logs can be found only if cluster is running else after terminating the cluster EMR does not pushes the logs into S3 log uri .
When i looked at it i got the real reason why it was failing . For me this was the reason of failure
On checking the master instance's instance-controller logs, i saw there were multiple core instances went into un-healthy state :
2019-02-27 07:50:03,905 INFO Poller: InstanceJointStatusMap contains 21 entries (R:21):
i-0131b7a6abd0fb8e7 1541s R 1500s ig-28 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com I: 18s Y:U 81s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
i-01672279d170dafd3 1539s R 1500s ig-28 ip-10-97-54-69.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 0.7%
i-0227ac0f0932bd0b3 1539s R 1500s ig-28 ip-10-97-51-197.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 4.1%
i-02355f335c190be40 1544s R 1500s ig-28 ip-10-97-52-150.tr-fr-nonprod.aws-int.thomsonreuters.com I: 22s Y:R 84s c: 0 am:241664 H:R 0.2%
i-024ed22b6affdd5ec 1540s R 1500s ig-28 ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:U 79s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
Also after some time yarn Blacklisted the Core instances:
2019-02-27 07:46:39,676 INFO Poller: Determining health status for App Monitor: aws157.instancecontroller.apphealth.monitor.YarnMonitor
2019-02-27 07:46:39,688 INFO Poller: SlaveRecord i-0ac26bd7886fec338 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: SlaveRecord i-0131b7a6abd0fb8e7 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: Update SlaveRecordDbRow for i-0131b7a6abd0fb8e7 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com
2019-02-27 07:47:13,696 INFO Poller: SlaveRecord i-024ed22b6affdd5ec changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,696 INFO Poller: Update SlaveRecordDbRow for i-024ed22b6affdd5ec ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com
On checking the instance nodes instance-controller logs, I can see the /mnt got full due to job caching and usage went beyond threshold i.e 90% by default.
Because of this yarn :
2019-02-27 07:40:52,231 INFO dsm-1: /mnt total 27633 MB free 2068 MB used 25565 MB
2019-02-27 07:40:52,231 INFO dsm-1: / total 100663 MB free 97932 MB used 2731 MB
2019-02-27 07:40:52,231 INFO dsm-1: cycle 17 /mnt/var/log freeSpaceMb: 2068/27633 MB freeRatio:0.07
2019-02-27 07:40:52,248 INFO dsm-1: /mnt/var/log stats :
-> As in my dataset, source table is having .gz compression. As .gz compressed files are non-splitable due to this 1 file is having 1 map task assigned to it. And as the map task will decompress the file in /mnt, it may also lead to that issue.
-> Processing a large amount of data in EMR needs some hive properties to be optimized. Below are the few optimization property can be set in the cluster to make the query run in a better way.
V.V.V.V.V.I
Increase the EBS volume size for Core instances
Important is that we have to increase the EBS voulume for each core not alone for the master because EBS volume is where /mnt gets mounted not on the route .
This alone has solved my problem but below configuration also helped me optimize the HIVE jobs
hive-site.xml
-------------
"hive.exec.compress.intermediate" : "true",
"hive.intermediate.compression.codec" : "org.apache.hadoop.io.compress.SnappyCodec",
"hive.intermediate.compression.type" : "BLOCK"
yarn-site.xml
-------------
"max-disk-utilization-per-disk-percentage" : "99"
And this has resolved my issue permanently .
Hope some one will get benefited with my answer