I am working on Azure Databricks, with Databricks Runtime version being - 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
. I am facing the following issue.
Suppose I have a view named v1 and a database f1_processed created from the following command.
CREATE DATABASE IF NOT EXISTS f1_processed
LOCATION "abfss://[email protected]/"
This is creating a database in the container named processed
. Suppose I already have some folder named circuits
in that container.
If I run the following command to create a managed table in parquet format from a dataframe in that location using the command below.
circuits_final_df.write.mode("overwrite").format("parquet").saveAsTable("f1_processed.circuits")
It gives an error as follows
SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name the managed table as
`spark_catalog`.`f1_processed`.`circuits`, as its associated location
'abfss://[email protected]/circuits' already exists.
Please pick a different table name, or remove the existing location first. SQLSTATE: 42710
However, if I try the same thing in delta format, it runs fine. So the following code runs fine.
circuits_final_df.write.mode("overwrite").format("delta").saveAsTable("f1_processed.circuits")
Also, while creating this delta table, it doesn't remove any files from the folder. It just adds the new files.
I know that a managed table is not formed if the location is already occupied. Shouldn't the behaviour be the same for all types? Also, since the result mixes the existing data and new data, it seems it is a bug and it should not happen. Any help is appreciated.
The error says that the specified location already exists, and it will throw an error because the existing location is associated with a different table or has some data in it. This is why you are getting the error LOCATION_ALREADY_EXISTS.
Formats other than Delta do not accept files already present at the place of the declaration of a managed table. If you do not want the existing data in the location, you can remove it before creating the managed table itself.
The command below will help you remove the location:
%fs rm -r abfss://[email protected]/circuits
Next, specify the location and saveAsTable:
location = "abfss://[email protected]/circuits"
df.write.mode("overwrite").format("parquet").saveAsTable("f1_processed.circuits")
I have tried the same:
As you mentioned when creating a delta table, it doesn't remove any files from the folder. It just adds the new files. because Delta Lake provides support for schema evolution and data versioning by efficiently managing metadata and file organization. When you create a managed table in Delta format with saveAsTable, Delta Lake adds new files to the existing directory without removing or modifying the existing files. This allows you to have a table that contains both the existing data and the new data.