apache-spark pyspark jupyter-notebook amazon-emr

How can I connect AWS EMR Notebooks to Oracle Database?

In a CLI, I moved to hadoop directory(in EMR) and downloaded ojdbc.jar file. And I tried to connect Oracle DB using spark shell commands below:

pyspark \
--jars "/home/hadoop/ojdbc6.jar" \
--master yarn-client \
--num-executors 5 \
--driver-memory 14g \
--executor-memory 14g \

df = spark.read \
          .format("jdbc") \
          .option("url", "jdbc:oracle:thin:USER/HOST@//IP:PORT/SERVICE") \
          .option("dbtable", "TABLE") \
          .option("user", "USER") \
          .option("password", "PASSWORD") \
          .option("driver", "oracle.jdbc.driver.OracleDriver") \
          .load()

It is working but using a terminal is inconvenient for me. So I want to connect EMR Notebooks to Oracle DB and tried codes below:

from pyspark import SparkContext, SparkConf

spark_conf = SparkConf().setAll([('spark.pyspark.python', 'python3'), 
                                 ('spark.pyspark.virtualenv.enabled', 'true'), 
                                 ('spark.pyspark.virtualenv.type', 'native'), 
                                 ('spark.pyspark.virtualenv.bin.path', '/usr/bin/virtualenv'), 
                                 ('spark.driver.extraClassPath', '/home/hadoop/ojdbc6.jar')])\
                        .setAppName('SparkJob')
sc = SparkContext.getOrCreate(conf=spark_conf)

df = sqlContext.read \
               .format("jdbc") \
               .options(url="jdbc:oracle:thin:USER/HOST@//IP:PORT/SERVICE", 
                        dbtable="TABLE", 
                        user="USER", 
                        password="PASSWORD", 
                        driver = "oracle.jdbc.driver.OracleDriver") \
               .load()

And got an error:

An error occurred while calling o97.load.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

Is there any missing step, in order to connect to the Oracle DB from the EMR Jupyter Notebooks?

Solution

You can use this solution, First upload the jdbc driver to s3 bucket and copy the link, then you can specify the jars files in first cell(first to be executed). For example I did this for ms sql jdbc driver (you need oracle here).

%%configure -f
{
    "conf": {
        "spark.jars": "s3://jar-test/mssql-jdbc-8.4.0.jre8.jar"        
    }
}

Also, here is the snippet to read from jdbc.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession, Window, DataFrame,Row


spark_session = SparkSession\
    .builder\
    .appName("test") \
    .getOrCreate()     



spark_context = spark_session.sparkContext
sql_context = SQLContext(spark_context)

df = sql_context.read.format("jdbc") \
    .option("url", "jdbc:sqlserver://<public-dns>:<port>") \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .option("dbtable","<table-name>") \
    .option("user", "<username>") \
    .option("password", "<password>") \
    .load()