I am working on Azure Databricks, with Databricks Runtime version being - 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
. I am facing the following issue.
Suppose I have a view named v1
and a database f1_processed
created from the following command.
CREATE DATABASE IF NOT EXISTS f1_processed
LOCATION "abfss://processed@formula1dl679student.dfs.core.windows.net/"
Then if I try to create a table at a location that already exists using the command below.
CREATE TABLE f1_processed.circuits
AS
SELECT * FROM v1;
I get the following error
[DELTA_CREATE_TABLE_WITH_NON_EMPTY_LOCATION] Cannot create table ('`spark_catalog`.`f1_processed`.`circuits`').
The associated location ('abfss://processed@formula1dl679student.dfs.core.windows.net/circuits')
is not empty and also not a Delta table. SQLSTATE: 42601
However, if I replace CREATE with CREATE OR REPLACE, then the command runs fine. So the following code runs fine.
CREATE OR REPLACE TABLE f1_processed.circuits
AS
SELECT * FROM v1;
The table didn't exist before. So CREATE OR REPLACE
is also basically creating a table. Shouldn't the behaviour be consistent with the CREATE
command?
Also, the table creation is only happening for delta format. If I specify the format to be any other like parquet, then it fails. Checkout this question
Is it a bug? Any help is appreciated
CREATE TABLE
Command: In CREATE TABLE
command, Apache Spark (and by extension, Databricks) expects the location specified for the table to be empty unless the table already exists as a Delta table. This is by design to prevent accidental data loss by overwriting existing data. If the location contains any files—even if a table does not technically exist in the metastore—Spark will throw an error to safeguard against potential unintended overwrites.
CREATE OR REPLACE TABLE
Command: This command is specifically built for Delta tables. Check this out. It is built for cases where there might be existing table at a location and you want to overwrite the table schema (the REPLACE command does that). When you use CREATE OR REPLACE TABLE
, it means that you are aware of the potential for data at the specified location and are intentionally choosing to overwrite it. It does not need to delete the previous data, rather it utilizes the delta files of the existing table to maintain the version history of the table. Also, note that, if there was an existing table at the same location with any format other than delta (like parquet, csv...) you would not have been able to create the delta table there even using this command. This command replaces only delta tables at the specified location.
It might seem inconsistent at first glance, but this behavior is intentional:
CREATE TABLE
: Prevents accidental overwriting of data.CREATE OR REPLACE TABLE
: Allows for idempotent operations, where you might be running scripts that need to ensure a table exists with a certain schema, potentially replacing an old schema or data without manual intervention.Use CREATE TABLE
When Creating New Tables: If you are sure that the location is new or empty, and you want to prevent accidental overwrites, use the CREATE TABLE
command.
Use CREATE OR REPLACE TABLE
for Idempotent Operations: When your scripts or operations need to ensure a table exists with the latest schema or data from a view or another table, and you are okay with replacing any existing data, use CREATE OR REPLACE TABLE
.
Check Before Creating: If you're programmatically managing tables and want to be extra cautious, you can check if a table exists or if a location is empty before deciding which command to use.