Search code examples
ibm-cloudwatson-studioanalytics-engine

How to read and write to hive tables from Apache Spark on Analytics Engine


I would like to use Spark on analytics engine from a Jupyter notebook in Watson Studio to be able to read and write to Hive tables, but it isn't clear how I can do this by just reading the Spark documentation. The reason that it isn't clear is that IBM Analytics Engine comes pre-configured with Spark and does not provide root access.

I have found some posts online for generic hadoop that describe creating a hive.xml with the metastore location, but it isn't clear how this is translated to the IBM Analytics Engine environment. For example: Unable to write data on hive using spark


Solution

  • Here is an example using scala ...

    Skip the first two steps if you are running spark directly on the Analytics Engine cluster.

    1. Create a Watson Studio Project and associate with an Analytics Engine service.

    2. Create a scala Notebook that uses the Analytics Engine Spark service

    3. Enter the following scala code to retrieve the HiveContext and list the Hive databases (if any)

        import org.apache.spark.sql.hive.HiveContext
        val hc = new HiveContext(sc)
    
        // uncomment and adjust the next line if you are using
        // Compose mysql for the hive metastore
    
        /*
           hc.setConf("hive.metastore.warehouse.dir", 
             "mysql://admin:[email protected]:32023/compose");
        */
    
        import hc.implicits._
        val df = hc.sql("show databases")
        df.show
    
    1. Create a Dataframe from static values for testing
        val test_df = Seq(
          (8, "bat"),
          (64, "mouse"),
          (-27, "horse")
        ).toDF("number", "word");
    
    1. Write the dataframe
        test_df.write.mode("overwrite").saveAsTable("src");
    
    1. Now verify that you are able to read the table from spark
        val read_df = hc.sql("select * from src")
        read_df.show
    
    1. Open a hive session and verify you can query the table from hive
        select * from src