I have a simple pyspark setup with a local master and no hive installed.
I create a SparkSession like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.legacy.createHiveTableByDefault", False)
Next I create a table:
spark.createDataFrame([('Alice', 1)], ['name', 'age']).writeTo("test").create()
This results in a folder test
inside spark-warehouse
, with a parquet file in it.
When I start a new SparkSession in the same way later, this does not read that folder.
It denies that any tables exist:
spark.catalog.listTables()
gives []
And
spark.sql("select * from test")
results in TABLE_OR_VIEW_NOT_FOUND.
How can I make it so that the tables are loaded into the catalog in a new spark session?
Thanks to @mazaneicha who pointed me in the right direction.
I created a hive-site.xml:
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
And now I create my Spark session with hive support:
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
And that works!