I have an external delta table partitioned by year in my storage, so when I try to "mount" the delta in Databricks table using this code:
CREATE TABLE if not exists profitek.products
USING DELTA
LOCATION 'dbfs:/mnt/curated-l1/profitek/products'
I get this error:
You are trying to create an external table spark_catalog
.general
.products
from dbfs:/mnt/curated-l1/general/products
using Databricks Delta, but there is no transaction log present at
dbfs:/mnt/curated-l1/general/products/_delta_log
. Check the upstream job to make sure that it is writing using
format("delta") and that the path is the root of the table.
The _delta_log file is inside every partition
products
│
├── 2022-01-01
│ ├── _delta_log
│ ├── part_01.parquet
│ └── part_02.parquet
├── 2022-01-02
├── _delta_log
├── part_01.parquet
└── part_02.parquet
So I can mount a PART of the table using something like this
CREATE TABLE if not exists profitek.products
USING DELTA
LOCATION 'dbfs:/mnt/curated-l1/profitek/products/2022-01-01'
How can I "mount" the full table, with all his partitions??
Your delta table is not partitioned using delta's partitioning capabilities: df.write.format("delta").partitionBy(<date_column_name>).save(path)
If done in the way described above, the directory structure would look like this:
products
├── _delta_log
│ └── 0000000.json
├── 2022-01-01
│ ├── part_01.parquet
│ └── part_02.parquet
├── 2022-01-02
├── part_01.parquet
└── part_02.parquet
As per the directory structure provided by you, it seems the delta tables were partitioned explicitly, i.e.: the partitions were explicitly written to those locations. Something like:
df_2022_01_01.write.format("delta").save("products/2022-01-01")
df_2022_01_02.write.format("delta").save("products/2022-01-02")
To get all the partitions in one df, you need to read individual ones and simply union them (assuming schema is same):
df_2022_01_01 = spark.read.format("delta").load("products/2022-01-01")
df_2022_01_02 = spark.read.format("delta").load("products/2022-01-02")
//then union the two dfs
df = df_2022_01_01.union(df_2022_01_02)
I've shown for 2 dfs only, you can use this approach to union dataframes in a for-loop.
Now that you have complete data in the df, you can partition it using delta for easier access later on:
df.write.format("delta").partitionBy(<date_column_name>).save("/products/complete-delta")