apache-spark jdbc cloudera cloudera-cdh apache-spark-2.0

How to distribute JDBC jar on Cloudera cluster?

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook. I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:

java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver

Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).

Solution

Spark has a jdbc format reader you can use.

launch a scala shell to confirm your MS SQL Server driver is in your classpath

example

Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")

If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session

example

bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

Connect to your MS SQL Server via Spark jdbc

example via spark python

# option1
jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

# option2
jdbcDF2 = spark.read \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
          properties={"user": "username", "password": "password"})

specifics and additional ways to compile connection strings can be found here

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)

https://medium.com/@thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6

at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition