Search code examples
amazon-web-servicesjdbcsqoopamazon-emr

What is the correct way of installing a JDBC driver on EMR for Sqoop to use?


I am running Sqoop 1.4.7 on AWS EMR 5.21.1 and am trying to import data from a database. I have successfully been able to do this manually where I create an EMR instance with Sqoop installed via the EMR Console.

Here are the preliminary steps that I performed in order to run sqoop on EMR

  1. Download the JDBC Driver
  2. Move the JDBC driver to the /usr/lib/sqoop/lib directory

I was able to successfully run a sqoop import when I was sshd into an EMR cluster with these commands:

wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar
sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/

When I try to run these commands from an EMR bootstrap script however I get the error:

usr/lib/sqoop/lib/ No such file or directory

After doing some investigation I realized this is because "Bootstrap actions execute before core services, such as Hadoop or Spark, are installed", as found here

So the /usr/lib/sqoop/lib directory doesnt exist when I run my bootstrap steps.

Here are some solutions which work but they feel like work-arounds

  1. Create the /usr/lib/sqoop/lib directory in my bootstrap script and then place the jar in it
  2. Add the jar to this directory as an EMR step. (Turns out this this is the correct approach, look at below accepted answer)

What is the correct way of installing this JDBC driver on EMR?


Solution

  • The 2nd option is the correct way to do it. The documentation explains running bash scripts as an EMR step.

    You can also use the jar command-runner.jar and the arguments to be

    bash -c "wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar;sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/"