I have been learning Spark (3.5.0) and I tried out the following exercise:
spark = pyspark.sql.SparkSession\
.builder\
.master("local")\
.appName("hello-spark")\
.getOrCreate()
spark.catalog.listDatabases()
I see this:[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse')]
spark.sql('create database my_db')
and create a table usingspark.sql("""create table my_db.fire_service_calls_tbl (CallNumber integer, UnitID string, IncidentNumber integer)
using parquet""")
spark.catalog.listDatabases()
, I see two databases like this:[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse'),
Database(name='my_db', catalog='spark_catalog', description='', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse/my_db.db')]
I also see the new my_db.db folder containing the tables that I have created in the file directory.
However, the next day when I start a new spark Session and I do a spark.catalog.listDatabases()
, spark can NOT find my_db at all! It just shows the default database:
Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse')]
Running spark.catalog.databaseExists('my_db')
returns False
I am sure I am missing some fundamental point on how this works. I was expecting to see my_db to show up but it did not.
I've been having the exact same problem. From what I can tell, the spark.catalog
object reads from memory rather than actually checking what has been persisted to disk. This obviously isn't what you want. YMMV but I found that adding some configuration to the session will get the behavior you want. Here's my config:
spark = SparkSession.builder \
.config("spark.sql.warehouse.dir", "path/to/your/catalog") \
.enableHiveSupport() \
.appName("your-app-name") \
.getOrCreate()
I believe that it is necessary to both of these things:
Doing this causes a metastore_db
directory to be created when you actually create a db. spark.catalog
will now read the information in there when you ask it to list databases / tables or check if a db / table exists.
One caveat is that I'm doing this on my local computer, not using any cloud or managed services.