Search code examples
javascalaapache-sparkscalarapache-zeppelin

Apache-Zeppelin / Spark : Why can't I access a remote DB with this code sample


I am doing my first own steps with Spark and Zeppelin and don't understand why this code sample isn't working.

First Block:

%dep
z.reset()                                                     // clean up 
z.load("/data/extraJarFiles/postgresql-9.4.1208.jar")         // load a jdbc driver for postgresql    

Second Block

%spark
// This code loads some data from a PostGreSql DB with the help of a JDBC driver.
// The JDBC driver is stored on the Zeppelin server, the necessary Code is transfered to the Spark Workers and the workers build the connection with the DB.
// 
// The connection between table and data source is "lazy".  So the data will only be loaded in the case that an action need them.
// With the current script means this the DB is queried twice.   ==> Q:  How can I keep a RDD in Mem or on disk?

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.JdbcRDD
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet

import org.apache.spark.sql.hive._ 
import org.apache.spark.sql._

val url = "jdbc:postgresql://10.222.22.222:5432/myDatabase"
val username = "postgres"
val pw = "geheim"

Class.forName("org.postgresql.Driver").newInstance                             // activating the jdbc driver. The jar file was loaded inside of the %dep block


case class RowClass(Id:Integer, Col1:String , Col2:String)                         // create a class with possible values

val myRDD = new JdbcRDD(sc,                                                    // SparkContext sc
                        () => DriverManager.getConnection(url,username,pw),    // scala.Function0<java.sql.Connection> getConnection
                        "select * from tab1 where \"Id\">=? and \"Id\" <=? ",  // String sql    Important: we need here two '?' for the lower/upper Bounds vlaues
                        0,                                                     // long lowerBound   = start value
                        10000,                                                // long upperBound,  = end value that is still included
                        1,                                                     // int numPartitions  = the area is spitted into x sub commands.   
                                                                               //  e.g. 0,1000,2  => first cmd from 0 ... 499, second cmd from 500..1000
                        row => RowClass(row.getInt("Id"),
                                        row.getString("Col1"),
                                        row.getString("Col2"))
                       )

myRDD.toDF().registerTempTable("Tab1")

// --- improved methode (not working at the moment)----
val prop = new java.util.Properties
prop.setProperty("user",username)
prop.setProperty("password",pw)

val tab1b = sqlContext.read.jdbc(url,"tab1",prop)             // <-- not working

tab1b.show

So what is the problem.

I want to connect to an external PostgreSql DB.

Block I is adding the necessary JAR file for the DB and first lines of the second block is already using the JAR and it is able get some data out of the DB.

But the first way is ugly, because you have to convert the data by your own into a table, so I want to use the easier method at the end of the script.

But I am getting the error message

java.sql.SQLException: No suitable driver found for jdbc:postgresql://10.222.22.222:5432/myDatabase

But it is the same URL / same login / same PW from the above code. Why is this not working?

Maybe somebody has a helpful hint for me.

---- Update: 24.3. 12:15 ---

I don't think the loading of the JAR is not working. I added an extra val db = DriverManager.getConnection(url, username, pw); for testing. (The function that fails inside of the Exception) And this works well.

Another interesting detail. If I remove the %dep block and class line, produces the first block a very similar error. Same Error Message; same function + line number that is failing, but the stack of functions is a bit different.

I have found the source code here: http://code.metager.de/source/xref/openjdk/jdk8/jdk/src/share/classes/java/sql/DriverManager.java

My problem is in line 689. So if all parameters are OK , maybe it comes from the isDriverAllowed() check ?


Solution

  • I ve had the same problem with dependencies in Zeppelin, and I had to add my jars to the SPARK_SUBMIT_OPTIONS in zeepelin-env.sh to have them included in all notebooks and paragraphs

    SO in zeppelin-env.sh you modify SPARK_SUBMIT_OPTIONS to be:

    export SPARK_SUBMIT_OPTIONS="--jars /data/extraJarFiles/postgresql-9.4.1208.jar

    Then you have to restart your zeppelin instance.