Search code examples
databricksazure-databricksdelta-lake

Mount external and partitioned delta table in Databricks


I have an external delta table partitioned by year in my storage, so when I try to "mount" the delta in Databricks table using this code:

CREATE TABLE if not exists profitek.products
                   USING DELTA                 
                   LOCATION 'dbfs:/mnt/curated-l1/profitek/products'

I get this error:

You are trying to create an external table spark_catalog.general.products from dbfs:/mnt/curated-l1/general/products using Databricks Delta, but there is no transaction log present at dbfs:/mnt/curated-l1/general/products/_delta_log. Check the upstream job to make sure that it is writing using format("delta") and that the path is the root of the table.

The _delta_log file is inside every partition

products
│
├── 2022-01-01
│   ├── _delta_log
│   ├── part_01.parquet
│   └── part_02.parquet
├── 2022-01-02
    ├── _delta_log
    ├── part_01.parquet
    └── part_02.parquet

So I can mount a PART of the table using something like this

CREATE TABLE if not exists profitek.products
                   USING DELTA                 
                   LOCATION 'dbfs:/mnt/curated-l1/profitek/products/2022-01-01'

How can I "mount" the full table, with all his partitions??


Solution

  • Your delta table is not partitioned using delta's partitioning capabilities: df.write.format("delta").partitionBy(<date_column_name>).save(path)

    If done in the way described above, the directory structure would look like this:

    products
    ├── _delta_log
    │   └── 0000000.json
    ├── 2022-01-01
    │   ├── part_01.parquet
    │   └── part_02.parquet
    ├── 2022-01-02
        ├── part_01.parquet
        └── part_02.parquet
    

    As per the directory structure provided by you, it seems the delta tables were partitioned explicitly, i.e.: the partitions were explicitly written to those locations. Something like:

    • df_2022_01_01.write.format("delta").save("products/2022-01-01")
    • df_2022_01_02.write.format("delta").save("products/2022-01-02")

    To get all the partitions in one df, you need to read individual ones and simply union them (assuming schema is same):

    df_2022_01_01 = spark.read.format("delta").load("products/2022-01-01")
    df_2022_01_02 = spark.read.format("delta").load("products/2022-01-02")
    //then union the two dfs
    df = df_2022_01_01.union(df_2022_01_02)
    

    I've shown for 2 dfs only, you can use this approach to union dataframes in a for-loop.

    Now that you have complete data in the df, you can partition it using delta for easier access later on: df.write.format("delta").partitionBy(<date_column_name>).save("/products/complete-delta")