Search code examples
hivehadoop-yarnhiveqlhadoop2amazon-emr

Hive Vertex failed, vertexName=Map 2 while running on EMR for on big files


I am running my hive query on EMR cluster that which is 25 nodes cluster and i have used r4.4xlarge in stances to run this .

When i run my query i get below error .

Job Commit failed with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: FEAF40B78D086BEE; S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=), S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=)'
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/var/lib/hadoop/steps/s-10YQZ5Z5PRUVJ/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-east-1.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f" "s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-22_App/d77a6a82-26f4-4f06-a1ea-e83677256a55/01/DeltaOutPut/processing/Scripts/script.sql" (RuntimeError)
Command exiting with ret '1'

I have tried settings all king of HIVE parameter combinations like below

 emrfs-site fs.s3.consistent.retryPolicyType    exponential
emrfs-site  fs.s3.consistent.metadata.tableName EmrFSMetadataAlt
emrfs-site  fs.s3.consistent.metadata.write.capacity    300
emrfs-site  fs.s3.consistent.metadata.read.capacity 600
emrfs-site  fs.s3.consistent    true
hive-site   hive.exec.stagingdir    .hive-staging
hive-site   hive.tez.java.opts  -Xmx47364m
hive-site   hive.stats.fetch.column.stats   true
hive-site   hive.stats.fetch.partition.stats    true
hive-site   hive.vectorized.execution.enabled   false
hive-site   hive.vectorized.execution.reduce.enabled    false
hive-site   tez.am.resource.memory.mb   15000
hive-site   hive.auto.convert.join  false
hive-site   hive.compute.query.using.stats  true
hive-site   hive.cbo.enable true
hive-site   tez.task.resource.memory.mb 16000

But every time it failed . I tried increasing the no of nodes/bigger instances in the EMR cluster but result is still same .

I also tried with and without Tez but still did not worked for me .

Here is my sample query .I am just copying the part of my query

insert into filediffPcfp.TableDelta
Select rgt.FILLER1,rgt.DUNSNUMBER,rgt.BUSINESSNAME,rgt.TRADESTYLENAME,rgt.REGISTEREDADDRESSINDICATOR

Please help me identify the issue .

Adding full yarn logs

2019-02-26 06:28:54,318 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to temp file: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_task_tmp.-ext-10000/_tmp.000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: New Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,681 [INFO] [TezChild] |exec.FileSinkOperator|: FS[11]: records written - 1
2019-02-26 06:28:54,877 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000
2019-02-26 06:28:56,632 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 10000
2019-02-26 06:29:13,301 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 100000
2019-02-26 06:31:59,207 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000000
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1551161362408_0001_7_01_000000_1 due to an invocation of shutdownRequested
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Received abort
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Forwarding abort to RecordProcessor
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.MapRecordProcessor|: Forwarding abort to mapOp: {} MAP
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |exec.MapOperator|: Received abort in operator: MAP
2019-02-26 06:34:42,705 [INFO] [TezChild] |s3.S3FSInputStream|: Encountered exception while reading '2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/IncrFile/WB.ACTIVE.OCT17_01_OF_10.gz', will retry by attempting to reopen stream.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException: 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
    at com.amazon.ws.emr.hadoop.fs.s3n.InputStreamWithInfo.read(InputStreamWithInfo.java:173)
    at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:136)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
    at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
    at java.io.InputStream.read(InputStream.java:101)
    at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:255)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)

Solution

  • Let me answer my own question .

    First thing very important that we have noticed while running HIVE jobs on EMR is that STEP error is misleading vertex failed will not point you in the correct direction .

    So better to check for hive logs.

    Now if our instance is terminated, then we can not log into master instance and see the logs , in that case we have to look for nodes application logs .

    Here is how we can find that nodes logs .

    Get the master instance id something like this (i-04d04d9a8f7d28fd1) and with that search for in nodes .

    Then open below path

    /applications/hive/user/hive/hive.log.gz
    

    Here you can find the expected error .

    Also we have to look for the containers logs for the failed nodes ,failed nodes details can be found in master instance node.

    hadooplogs/j-25RSD7FFOL5JB/node/i-03f8a646a7ae97aae/daemons/
    

    This daemons nodes logs can be found only if cluster is running else after terminating the cluster EMR does not pushes the logs into S3 log uri .

    When i looked at it i got the real reason why it was failing . For me this was the reason of failure

    On checking the master instance's instance-controller logs, i saw there were multiple core instances went into un-healthy state :

      2019-02-27 07:50:03,905 INFO Poller: InstanceJointStatusMap contains 21 entries (R:21):
      i-0131b7a6abd0fb8e7  1541s R   1500s ig-28 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com  I:   18s Y:U    81s c: 0 am:    0 H:R  0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
      i-01672279d170dafd3  1539s R   1500s ig-28 ip-10-97-54-69.tr-fr-nonprod.aws-int.thomsonreuters.com  I:   16s Y:R    79s c: 0 am:241664 H:R  0.7%
      i-0227ac0f0932bd0b3  1539s R   1500s ig-28 ip-10-97-51-197.tr-fr-nonprod.aws-int.thomsonreuters.com  I:   16s Y:R    79s c: 0 am:241664 H:R  4.1%
      i-02355f335c190be40  1544s R   1500s ig-28 ip-10-97-52-150.tr-fr-nonprod.aws-int.thomsonreuters.com  I:   22s Y:R    84s c: 0 am:241664 H:R  0.2%
      i-024ed22b6affdd5ec  1540s R   1500s ig-28 ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com  I:   16s Y:U    79s c: 0 am:    0 H:R  0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
    

    Also after some time yarn Blacklisted the Core instances:

      2019-02-27 07:46:39,676 INFO Poller: Determining health status for App Monitor: aws157.instancecontroller.apphealth.monitor.YarnMonitor
      2019-02-27 07:46:39,688 INFO Poller: SlaveRecord i-0ac26bd7886fec338 changed state from RUNNING to BLACKLISTED
      2019-02-27 07:47:13,695 INFO Poller: SlaveRecord i-0131b7a6abd0fb8e7 changed state from RUNNING to BLACKLISTED
      2019-02-27 07:47:13,695 INFO Poller: Update SlaveRecordDbRow for i-0131b7a6abd0fb8e7 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com 
      2019-02-27 07:47:13,696 INFO Poller: SlaveRecord i-024ed22b6affdd5ec changed state from RUNNING to BLACKLISTED
      2019-02-27 07:47:13,696 INFO Poller: Update SlaveRecordDbRow for i-024ed22b6affdd5ec ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com 
    

    On checking the instance nodes instance-controller logs, I can see the /mnt got full due to job caching and usage went beyond threshold i.e 90% by default.

    Because of this yarn :

    2019-02-27 07:40:52,231 INFO dsm-1: /mnt   total   27633 MB free    2068 MB used  25565 MB
      2019-02-27 07:40:52,231 INFO dsm-1: /      total  100663 MB free   97932 MB used   2731 MB
      2019-02-27 07:40:52,231 INFO dsm-1: cycle 17 /mnt/var/log freeSpaceMb: 2068/27633 MB freeRatio:0.07
      2019-02-27 07:40:52,248 INFO dsm-1: /mnt/var/log stats : 
    

    -> As in my dataset, source table is having .gz compression. As .gz compressed files are non-splitable due to this 1 file is having 1 map task assigned to it. And as the map task will decompress the file in /mnt, it may also lead to that issue.

    -> Processing a large amount of data in EMR needs some hive properties to be optimized. Below are the few optimization property can be set in the cluster to make the query run in a better way.

    V.V.V.V.V.I

    Increase the EBS volume size for Core instances
    

    Important is that we have to increase the EBS voulume for each core not alone for the master because EBS volume is where /mnt gets mounted not on the route .

    This alone has solved my problem but below configuration also helped me optimize the HIVE jobs

    hive-site.xml
                    -------------
                    "hive.exec.compress.intermediate" : "true",
                    "hive.intermediate.compression.codec" : "org.apache.hadoop.io.compress.SnappyCodec",
                    "hive.intermediate.compression.type" : "BLOCK"
    
                    yarn-site.xml
                    -------------
                    "max-disk-utilization-per-disk-percentage" : "99"
    

    And this has resolved my issue permanently .

    Hope some one will get benefited with my answer