Search code examples
pythonpostgresqlibm-cloudpysparkcompose-db

Accessing data in a Compose PostgreSQL database from Spark as a Service Python notebook on Bluemix


I have data in a postgres database that I am trying to access through Spark as a Service on IBM Bluemix (using a python notebook). Here is my code:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.load(source="jdbc",\
                 url="jdbc:postgresql://[publichost]:[port]/compose",\
                 dbtable="[tablename]")
df.take(2)

The error I'm getting (during the df = line) is:

Py4JJavaError: An error occurred while calling o42.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://host:port/compose
at java.sql.DriverManager.getConnection(DriverManager.java:700)
at java.sql.DriverManager.getConnection(DriverManager.java:219)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:188)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:181)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:507)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:785)

Can I update this driver? Any advice or a working example would be much appreciated!


Solution

  • This happens because the postgresql driver is not installed by default in your spark service instance.

    You would need to add it first to use it.

    Change the kernel to Scala from the menu to execute below statement, you only need to execute this once per spark instance and then subsequent use postgres driver irrespective of kernel type(Python,Scala,R), you can simply import it
    In [1]:
    %Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar
    Starting download from https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar
    Finished download of postgresql-9.4.1207.jre7.jar
    In [5]:
    #Now change the kernel back to Python
    In [1]:
    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    In [3]:
    #Ignore the Connection Error which is because of the invalid connection details
    #Just simply change the publichost to your hostname and port number and databasename and
    #tablename
    In [4]:
    df = sqlContext.load(source="jdbc",\
                     url="jdbc:postgresql://[publichost]:[port]/databasename",\
                     dbtable="[tablename]")
    

    Complete Importable Notebook See below https://github.com/charles2588/bluemixsparknotebooks/raw/master/Python/python_postgres.ipynb

    Thanks, Charles.