I have created an EMR cluster thru AWS CLI
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper
--tags Name="EMR-Atlas" --release-label emr-5.16.0 --ec2-attributes SubnetId=subnet-xxxxx,
KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,
InstanceType=m4.xlarge --log-uri s3://xxx/logs/new-log --steps Name="Run Remote Script",
Jar=command-runner.jar,Args=
[bash,-c,
"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh
-o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]
Then I have established a SSH connection for HUE:
--ssh -L 8888:localhost:8888 -i key.pem hadoop@<EMR Master IP Address>
I have created a Hive table thru HUE :
CREATE external TABLE us_disease
(
YearStart int,
StratificationCategory2 string,
GeoLocation string,
ResponseID string,
LocationID int,
TopicID string
)
row format delimited
fields terminated by ','
LOCATION 's3://XXXX/data/USHealthcare/'
TBLPROPERTIES ("skip.header.line.count"="1");
I am able to fetch records with SELECT statement thru HUE.
But, if I try to execute the select statement thru HQL it fails. I tried in the following way: My HQL is plain SELECT statment
select * from us_disease limit 10;
and I have stored the same in S3 as hive.hql.
I executed the hql thru step in emr cluster:
Log :
INFO redirectError to /mnt/var/log/hadoop/steps/s-xxxxxxxx/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-xxxxxxxx
INFO ProcessRunner started child process 30597 :
hadoop 30597 5505 0 11:40 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar hive-script --run-hive-script --args -f s3://dif-test/data-governance/hql/hive.hql
2021-03-30T11:40:36.318Z INFO HadoopJarStepRunner.Runner: startRun() called for s-xxxxxxxx Child Pid: 30597
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 127 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 2 seconds
2021-03-30T11:40:36.437Z INFO Step created jobs:
2021-03-30T11:40:36.438Z WARN Step failed with exitCode 127 and took 2 seconds
stderr:
/usr/lib/hadoop/bin/hadoop: line 169: /etc/alternatives/jre/bin/java: No such file or directory
Any help appreciated. Thank you.
The issue got fixed after I updated the emr version. Previously I was using emr-5.16.0 . I changed to emr-5.32.0.
Modified code :
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --tags Name="EMR-Atlas" --release-label emr-5.32.0 --ec2-attributes SubnetId=subnet-xxxx,KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge --log-uri s3://xxx/xxx/new-log --steps Name="Run Remote Script",Jar=command-runner.jar,Args=[bash,-c,"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]