Search code examples
apache-sparkpyspark

Spark without hive - cannot read existing table


I have a simple pyspark setup with a local master and no hive installed.

I create a SparkSession like this:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.legacy.createHiveTableByDefault", False)

Next I create a table:

spark.createDataFrame([('Alice', 1)], ['name', 'age']).writeTo("test").create()

This results in a folder test inside spark-warehouse, with a parquet file in it.

When I start a new SparkSession in the same way later, this does not read that folder.

It denies that any tables exist:

spark.catalog.listTables() gives []

And

spark.sql("select * from test")

results in TABLE_OR_VIEW_NOT_FOUND.

How can I make it so that the tables are loaded into the catalog in a new spark session?


Solution

  • Thanks to @mazaneicha who pointed me in the right direction.

    I created a hive-site.xml:

    <property>
      <name>hive.metastore.local</name>
      <value>true</value>
    </property>
    

    And now I create my Spark session with hive support:

    spark = SparkSession.builder.enableHiveSupport().getOrCreate()

    And that works!