Search code examples
ibm-cloudapache-spark-sqlpysparkjupyterdashdb

How to executing arbitrary sql from pyspark notebook using SQLContext?


I'm trying a basic test case of reading data from dashDB into spark and then writing it back to dashDB again.

Step 1. First within the notebook, I read the data:

sqlContext = SQLContext(sc)
dashdata = sqlContext.read.jdbc(
     url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",  
     table="GOSALES.BRANCH"
).cache()

Step 2. Then from dashDB I create the target table:

DROP TABLE ****.FROM_SPARK;
CREATE TABLE ****.FROM_SPARK AS (
    SELECT *
    FROM GOSALES.BRANCH
) WITH NO DATA

Step 3. Finally, within the notebook I save the data to the table:

from pyspark.sql import DataFrameWriter
writer = DataFrameWriter(dashdata)
dashdata = writer.jdbc(
    url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",  
    table="****.FROM_SPARK"
)

Question: Is it possible to run the sql in step 2 from pyspark? I couldn't see how this could be done from the pyspark documentation. I don't want to use vanilla python for connecting to dashDB because of the effort involved in setting up the library.


Solution

  • Use ibmdbpy. See this brief demo.

    With as_idadataframe() you can upload DataFrames into dashDB as a table.


    Added key steps here as stackoverflow doesn't like linking to answers:

    Step 1: Add a cell containing:

    #!pip install --user future
    #!pip install --user lazy
    #!pip install --user jaydebeapi
    #!pip uninstall --yes ibmdbpy
    #!pip install ibmdbpy --user --no-deps
    #!wget -O $HOME/.local/lib/python2.7/site-packages/ibmdbpy/db2jcc4.jar https://ibm.box.com/shared/static/lmhzyeslp1rqns04ue8dnhz2x7fb6nkc.zip 
    

    Step 2: Then from annother notebook cell

    from ibmdbpy import IdaDataBase
    idadb = IdaDataBase('jdbc:db2://<dashdb server name>:50000/BLUDB:user=<dashdb user>;password=<dashdb pw>') 
    ....