Search code examples
oracle-databaseodbcdatabase-connectionrodbcsparklyr

use sparklyr with Oracle database connection


I would like to get some explanations concerning the way to combine following R packages:

-odbc : used to connect to existing Oracle data source

-sparklyr : used to compute this data on a standalone Spark cluster

Here is what I have done :

-on my client computer, I used dbConnect() function from ODBC R package to connect to an existing Oracle database. This Oracle database is hosted on a windows server.

I separately implemented a Spark standalone cluster with some computers located on the same local network but isolated from the windows server: by using this Spark cluster, I would like to use spark_connect() function of sparklyr package to connect my client computer ( which is connected to my Oracle data base ) to the Spark cluster. As a resume my objective consists to use the spark standalone cluster to execute parallel processing (e.g. ml_regression_trees) of data stored on my oracle data base.

Does someone know if there is a function on sparklyr to do all of this directly ? ( I mean: connection to Oracle database + big data processing with Spark )

Thank you very much for your help ( any advices are welcome!)


Solution

  • Knowing that you asked for an ODBC way, here is an JDBC solution (probably useful for other users and due to the fact, that ODBC is not mentioned in the question title.

    You need to have ojdbc7.jar somewhere (in this case in your working directory, but I recommend to store it central and provide the path here). Change the required values like spark_home etc. If you are running R on your client computer (and not on an edge node in the cluster), you might use Livy to connect to Spark.

    library(sparklyr)
    library(RJDBC)
    
    ##### Spark
    config <- spark_config()
    ### tell config location of oracle jar
    config[["sparklyr.jars.default"]] <- "ojdbc7.jar"
    ### example spark_home
    sc <- spark_connect(master = "yarn-client",
                        spark_home = "/usr/lib/spark",
                        version = "2.2.0",
                        config = config)
    
    datspark <- spark_read_jdbc(sc, "table", options = list(
      url = "jdbc:oracle:thin:@//<ip>:1521/<schema>",
      driver = "oracle.jdbc.OracleDriver",
      user = "user",
      password = "password",
      dbtable = "table"),
      memory = FALSE # don't cache the whole (big) table
      )
    
    ### your R code here
    
    spark_disconnect(sc)