Search code examples
jdbcpysparkjupyter-notebookbeelinespark-thriftserver

How can I make my pyspark database visible to beeline on windows?


Hey All, I need some help to open a pickle jar in pyspark. I'm learning pyspark and as an exercise I'm trying to do something that I assumed would be a simple thing... but its giving me a very hard time.

So let's start with what I'm doing. I'm working with windows and I'm trying to start a local thrift server and connect via beeline to a database I created within my app in a spark session.

::step1 - start master in cmd
start %SPARK_HOME%\bin\spark-class.cmd org.apache.spark.deploy.master.Master

Step 1 works great and I can see the master server spark://n.n.n.n:7077 created with my spark 2.3.2. in localhost:8080.

::step2 - start worker in cmd
start %SPARK_HOME%\bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://n.n.n.n:7077

So far so good and I can see I have a worker alive with 8 cores running.

::step3 - start thrift server
start %SPARK_HOME%\bin\spark-submit.cmd --verbose --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal --hiveconf hive.server2.thrift.port=10000 --master spark://n.n.n.n:7077 --total-executor-cores 4

Step 3 ok, I can see the SparkSQL application running and I can even connect to it via beeline!

::if I do
start %SPARK_HOME%\bin\beeline.cmd

Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: username
Enter password for jdbc:hive2://localhost:10000: password

Connected to: Spark SQL (version 2.3.2)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> SHOW TABLES;
+-----------+------------+--------------+--+
| database  | tableName  | isTemporary  |
+-----------+------------+--------------+--+
+-----------+------------+--------------+--+

Final Step: create my app and feed it to my spark master I now go to jupiter-notebook to run my application with:

spark=SparkSession.builder.appName("Publish a database attempt").master("spark://n.n.n.n:7077").config('spark.sql.warehouse.dir','file:///D:/tmp/SparkDatabase/').getOrCreate()
df1=spark.read.csv('C:/data.csv',inferSchema=True,header=True)
df1.coalesce(1).write.mode('overwrite').saveAsTable('testTable')

spark.sql('show tables in default').show()
+--------+-----------+-----------+
|database|  tableName|isTemporary|
+--------+-----------+-----------+
| default|  testTable|      false|
+--------+-----------+-----------+

spark.sql("SELECT * FROM testTable").show()
+--------+--------+--------+
| column1| column2| column3|
+--------+--------+--------+
|      hi|   there|  friend|
+--------+--------+--------+    

This is when the spark hits the fan... I can see this database in my application session but its not showing with beeline. Going back to beeline and doing another SHOW TABLES; Don't do the trick.

Comments:

  1. I noticed that when I start the thrift server it creates a metastore_db and this doesn't happen when I create my application from jupiter notebook.
  2. If I try to create my application from a pyspark instance instead of jupyter-notebook I get a ERROR XSDB6: Another instance of Derby running.

What I'm trying to achieve here is just to be able to see and query my table via beeline in windows as an exercise. Would really appreciate the help with that pickle jar.

using: conda 4.5.12; Python 3.6.5; PySpark 2.3.2; hadoop-3.0.0


Solution

  • I've found another question in stackoverflow that contains a way to solve my issue and I can consider as an answer: Start HiveThriftServer programmatically in Python

    Essentially I need to forget about step 3 and run my application in Jupyter-notebook with:

    from pyspark import *
    from pyspark.sql import *
    
    spark = SparkSession.builder \
     .appName('Hi there') \
     .master('spark://n.n.n.n:7077') \
     .enableHiveSupport() \
     .config('spark.sql.hive.thriftServer.singleSession', True) \
     .getOrCreate()
    
    #Start the Thrift Server using the jvm and passing the same spark session corresponding to pyspark session in the jvm side.
    spark.sparkContext._gateway.jvm.org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark._jwrapped) 
    
    df1=spark.read.csv('C:/data.csv',inferSchema=True,header=True)
    df1.coalesce(1).write.mode('overwrite').saveAsTable('testTable')
    

    That way I will start the Thrift Server using the java for the current session. With that, I can easily connect to all my tables and access the data with beeline or other odbc connector.