Currently I'm trying to integrate AWS EMR with Talend.
My purpose is to run the Talend job ( exported by Talend studio ) on AWS EMR. I already tried "add step as custom jar", but it seems Talend job run by using also exported lib
folder and script.
I would like to run it with fat jar, however this question shows we can't do that because of lacking a feature to export JAR file as fat jar. --> how to export talend job as single fat jar
Is there any good way to integrate Talend job with Amazon EMR ?
Finally, I resolved this problem by using script-runner.jar
provided by AWS.
I created Lambda script to start EMR clusters. And I append HadoopJarStep
.
This allows us to use some shell scripts to download & kick the Talend job shell script.
Please see Boto3 Docs - EMR to know the meaning
'HadoopJarStep': {
'Jar': 's3://ap-northeast-1.elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
's3://your/bucket/name.../talend_run.sh'
]
}
talend_run.sh
is like below (a pity shellscript)
#!/bin/bash
echo "Start, Talend!!"
# here is exported talend job ( contains bootstrap shellscript )
ZIP_NAME=talend-batch-zipped.zip
DIR_NAME=talend-batch-zipped/kicker
SHELL_NAME=kicker_run.sh
# update package because EMR servers don't have unzip command
sudo yum update -y
sudo yum install -y wget unzip
rm -rf ${ZIP_NAME}
wget https://your.using.s3.host.name/${ZIP_NAME} -P `pwd`/
# unzip the zipped job file into the EMR server
unzip ${ZIP_NAME}
cd ${DIR_NAME}
# pass parameters to talend job
bash ./${SHELL_NAME} "$@"
After I started AWS Lambda function, the EMR cluster is created. After that, a step (above shell) processed by the EMR server.