Search code examples
apache-sparkjdbcclouderacloudera-cdhapache-spark-2.0

How to distribute JDBC jar on Cloudera cluster?


I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook. I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:

java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver

Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).


Solution

  • Spark has a jdbc format reader you can use.

    launch a scala shell to confirm your MS SQL Server driver is in your classpath

    example

    Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")

    If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session

    example

    bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

    Connect to your MS SQL Server via Spark jdbc

    example via spark python

    # option1
    jdbcDF = spark.read \
        .format("jdbc") \
        .option("url", "jdbc:postgresql:dbserver") \
        .option("dbtable", "schema.tablename") \
        .option("user", "username") \
        .option("password", "password") \
        .load()
    
    # option2
    jdbcDF2 = spark.read \
        .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
              properties={"user": "username", "password": "password"})
    

    specifics and additional ways to compile connection strings can be found here

    https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

    you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)

    https://medium.com/@thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6

    at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition