Search code examples
apache-spark-sqlapache-zeppelin

When registering a table using the %pyspark interpreter in Zeppelin, I can't access the table in %sql


I am using Zeppelin 0.5.5. I found this code/sample here for python as I couldn't get my own to work with %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/. I have a feeling his %pyspark example worked because if you using the original %spark zeppelin tutorial the "bank" table is already created.

This code is in a notebook.

%pyspark
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(),     False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")

This code is in the same notebook but different work pad.

%sql 
SELECT count(1) FROM bank

org.apache.spark.sql.AnalysisException: no such table bank; line 1 pos 21
...

Solution

  • I found the problem to this issue. Prior to 0.6.0 the sqlContext variable is sqlc in %pyspark.

    Defect can be found here: https://issues.apache.org/jira/browse/ZEPPELIN-134

    In Pyspark, the SQLContext is currently available in the variable name sqlc. This is incosistent with the documentation and with the variable name in scala which is sqlContext.

    sqlContext can be used as a variable for the SQLContext, in addition to sqlc (for backward compatibility)

    Related code: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66

    The suggested workaround is simply to do the following in your %pyspark script

    sqlContext = sqlc

    Found here:

    https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.com%3E